Senior Kotlin/Java Engineer · Synthesized.io (confirmed) · Core Data Engine · via NEWHR
Product: E2E Test Data Management platform (TDK — Test Data Kit). Three core features: data masking (replace PII with realistic fake data), synthetic data generation (create production-like data from scratch), database subsetting (extract consistent smaller copy of production DB).
Clients: Deutsche Bank, UBS, European Commission, Accenture, Clarity AI. Enterprise-grade, used in banking, healthcare, telecom.
Funding: Series A. Investors: IQ Capital, Seedcamp, Deutsche Bank. Offices: London (Shoreditch), New York, Alpharetta GA.
Supported databases: PostgreSQL, MySQL, Oracle, SQL Server, SAP HANA, DB2, MariaDB, Salesforce, Snowflake, BigQuery, Redshift. ~15 total. Deployed via Docker/Kubernetes/CLI. CI/CD integration (Jenkins, GitLab, GitHub Actions).
Config approach: "Data as Code" — YAML config files define masking rules, generation schemas, subsetting criteria. Deterministic, version-controlled, auditable.
Your role (core team): Build the data engine that connects to customer databases, reads schemas via JDBC, analyzes FK relationships, and executes masking/generation/subsetting. Handle: TB-scale data volumes, 500+ table schemas, cyclical FK references, cross-database type mapping, JVM performance optimization.
Key docs concepts to know: Transformers (masking/generation functions), Virtual Foreign Keys (user-defined FKs not in schema), CycleResolutionStrategy (handling circular references), Workers (parallel processing units), PII auto-detection.
STRONG Java backend — 4 years Java (Amazon, Alfa-Bank). Kotlin at Amazon.
STRONG Concurrency & multithreading — Moscow Exchange trading (microsecond latency), Amazon distributed systems.
STRONG High-load / low-latency — Moscow Exchange (trading system architecture), Alfa-Bank (trading infrastructure).
STRONG Relational databases — 15 years SQL at Alfa-Bank. PostgreSQL, MS SQL Server, Firebird.
STRONG Large data volumes — Moscow Exchange market data, Alfa-Bank trading history.
STRONG Algorithms & data structures — 25 years + interview prep with 120 quiz questions.
STRONG Open-ended research problems — System Architect role at Moscow Exchange.
PREP JVM performance tuning — GC tuning, profiling, memory optimization. You've done it but need to articulate specific techniques.
PREP Database internals — how PostgreSQL/MySQL store data, MVCC, query planning, JDBC internals. Beyond just writing SQL.
PREP Data masking/synthetic data concepts — referential integrity, format-preserving encryption, statistical distributions.
NICE ClickHouse — column-oriented database. Read basics for the interview.
1. HR screening (NEWHR) — Russian, 30 min. Background, motivation, salary, English check.
2. Technical screening — 60-90 min. JVM internals, concurrency, database knowledge, algorithm problem.
3. System design / Deep dive — "Design a data masking engine for a 500-table PostgreSQL database." Architecture, performance, trade-offs.
4. Culture fit / Founder interview — Startup values, ownership, ambiguity tolerance.
The job says: "Improve JVM performance: concurrency, memory usage, latency, throughput"
Step 1: Define "slow" — high latency (p99 > threshold)? Low throughput? High CPU? High memory? GC pauses?
Step 2: Choose the right profiler:
CPU profiling: async-profiler (sampling, low overhead, production-safe) or JFR (Java Flight Recorder, built into JVM). Find which methods consume CPU time. Look for: hot loops, inefficient algorithms, excessive object creation.
Memory profiling: JFR + jcmd for allocation profiling. heap dump + Eclipse MAT for leak analysis. Look for: large object retention, growing collections, unclosed resources.
GC analysis: enable GC logs (-Xlog:gc*), analyze with GCViewer or GCEasy. Look for: long STW pauses, frequent Full GC, promotion failures.
Latency: distributed tracing (OpenTelemetry) for per-request breakdown. Which service/DB call is the bottleneck?
Step 3: Typical findings for a data engine:
Problem: Processing 100M rows takes 30 minutes Profile shows: 40% — GC pauses (too many temporary objects) 30% — JDBC ResultSet processing (row-by-row instead of batch) 20% — String allocations (building SQL strings) 10% — actual data transformation logic Fixes: 1. Object pooling / flyweight for repeated values 2. Batch JDBC fetches (fetchSize = 10000) 3. StringBuilder / prepared statements 4. Off-heap buffers for large datasets
The problem: processing millions of rows creates millions of temporary objects. GC spends more time collecting than the app spends processing.
Technique 1: Reduce allocations
// BAD — new String per row:
for (row in resultSet) {
val masked = "***" + row.getString("email").substringAfter("@")
// Creates: substring String, concatenation String, intermediate CharArray
// 10M rows = 30M String objects = GC nightmare
}
// GOOD — reuse StringBuilder:
val sb = StringBuilder(256)
for (row in resultSet) {
sb.clear()
sb.append("***").append(row.getString("email"), atIndex, length)
val masked = sb.toString() // one allocation per row
}
Technique 2: Primitive arrays instead of boxed collections
// BAD: List<Int> = List<Integer> on JVM. Each Integer = 16 bytes object.
val ids: List<Int> = rows.map { it.getInt("id") } // 10M Integer objects
// GOOD: IntArray = int[] on JVM. No boxing.
val ids = IntArray(rowCount) { resultSet.getInt("id") } // 10M * 4 bytes = 40MB
Technique 3: Off-heap memory (DirectByteBuffer)
// Store large datasets outside GC-managed heap: val buffer = ByteBuffer.allocateDirect(1024 * 1024 * 100) // 100MB off-heap // GC doesn't scan this memory. No pauses. // Must manage lifecycle manually (like C/C++ malloc/free)
Technique 4: Object pooling
// Reuse objects instead of creating new ones:
val rowPool = ArrayDeque<MutableRow>(10000)
fun getRow(): MutableRow = rowPool.pollFirst() ?: MutableRow()
fun returnRow(row: MutableRow) { row.clear(); rowPool.addLast(row) }
Technique 5: Choose the right GC
For a data engine processing batch data: G1 GC with tuned region size. For latency-sensitive API layer: ZGC. For maximum throughput batch processing: Parallel GC.
Default behavior: PostgreSQL JDBC loads the ENTIRE result set into memory before returning the first row. Query returns 50M rows = 50M rows in JVM heap = OutOfMemoryError.
// DEFAULT — loads everything into memory:
val stmt = connection.createStatement()
val rs = stmt.executeQuery("SELECT * FROM big_table") // 50M rows in memory!
// WITH fetchSize — streaming, N rows at a time:
connection.autoCommit = false // required for PostgreSQL streaming
val stmt = connection.createStatement()
stmt.fetchSize = 10000 // fetch 10,000 rows at a time
val rs = stmt.executeQuery("SELECT * FROM big_table")
// Only 10,000 rows in memory. Next batch fetched when needed.
Why autoCommit = false is required: PostgreSQL uses server-side cursors for streaming. Cursors only work within a transaction. With autoCommit = true, each statement is its own transaction = no cursor = full result loaded.
MySQL difference: MySQL uses ResultSet.TYPE_FORWARD_ONLY + fetchSize = Integer.MIN_VALUE for streaming. Different driver, different API.
For a data engine this is CRITICAL: you're scanning entire databases (hundreds of GB). Without streaming, you can't even start processing. fetchSize controls memory usage vs network round trips — tune it based on row size and available heap.
# Memory: -Xms4g -Xmx4g # fixed heap, no resize overhead -XX:MaxDirectMemorySize=2g # for off-heap buffers # GC (for batch processing — throughput over latency): -XX:+UseG1GC # or -XX:+UseParallelGC for pure batch -XX:MaxGCPauseMillis=200 # G1 target pause time -XX:G1HeapRegionSize=16m # larger regions for large heap -XX:InitiatingHeapOccupancyPercent=45 # trigger concurrent GC earlier # GC logging: -Xlog:gc*:file=gc.log:time,level,tags:filecount=10,filesize=10m # Performance: -XX:+UseStringDeduplication # G1 deduplicates identical strings in heap -XX:+OptimizeStringConcat # optimize StringBuilder chains -XX:+UseCompressedOops # 32-bit object references in <32GB heap # Diagnostics: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof -XX:+FlightRecorder # enable JFR for profiling
For data engine specifically: -XX:+UseStringDeduplication is huge — if you're processing database dumps, many column values are repeated strings (country names, statuses, types). G1 deduplicates them in memory automatically.
This role emphasizes raw Java concurrency more than Kotlin coroutines
Approach 1: Partition by primary key ranges
// Split into N chunks by ID range:
val totalRows = 100_000_000
val chunkSize = 1_000_000 // 1M rows per chunk
val chunks = (0 until totalRows step chunkSize).map { offset ->
"SELECT * FROM users WHERE id BETWEEN $offset AND ${offset + chunkSize - 1}"
}
// Process chunks in parallel:
val executor = Executors.newFixedThreadPool(8)
val futures = chunks.map { sql ->
executor.submit<ProcessingResult> {
processChunk(sql) // each chunk on its own JDBC connection
}
}
val results = futures.map { it.get() } // wait for all
Approach 2: Producer-consumer with BlockingQueue
// One reader thread (fast I/O):
val queue = ArrayBlockingQueue<Row>(10_000)
val reader = Thread {
val rs = stmt.executeQuery("SELECT * FROM users")
while (rs.next()) {
queue.put(Row.from(rs)) // blocks if queue full (backpressure!)
}
queue.put(POISON_PILL) // signal end
}
// Multiple processor threads (CPU-intensive):
val processors = (1..8).map { Thread {
while (true) {
val row = queue.take() // blocks if queue empty
if (row === POISON_PILL) break
maskAndWrite(row)
}
}}
Approach 3: Kotlin coroutines with Channel (modern)
val channel = Channel<Row>(capacity = 10_000)
// Producer:
launch(Dispatchers.IO) {
resultSet.forEach { row -> channel.send(row) }
channel.close()
}
// N consumers:
repeat(8) {
launch(Dispatchers.Default) {
for (row in channel) { maskAndWrite(row) }
}
}
Key consideration: each approach needs its own JDBC connection for reading/writing. Connection pool size must match parallelism level. Database might be the bottleneck, not the JVM.
ForkJoinPool is designed for divide-and-conquer parallelism with work-stealing. Each thread has a deque of tasks. When a thread finishes its tasks, it steals from other threads' deques.
// Recursive task — process a large dataset by splitting:
class MaskTask(val data: List<Row>, val threshold: Int = 1000)
: RecursiveAction() {
override fun compute() {
if (data.size <= threshold) {
// Small enough — process directly
data.forEach { maskRow(it) }
} else {
// Split in half, process both in parallel
val mid = data.size / 2
val left = MaskTask(data.subList(0, mid))
val right = MaskTask(data.subList(mid, data.size))
invokeAll(left, right) // fork both, join when done
}
}
}
ForkJoinPool(8).invoke(MaskTask(allRows))
When to use: CPU-bound parallel processing of large in-memory datasets. Not for I/O. Java parallel streams use ForkJoinPool internally.
Work-stealing advantage: if one thread's partition finishes early (data is unevenly sized), it steals tasks from busy threads. Automatic load balancing. Regular ThreadPool doesn't do this.
Option 1: Batch inserts per thread, no conflicts
// Each thread writes to a DIFFERENT temporary table, merge at end: Thread 1 → INSERT INTO masked_users_part1 ... Thread 2 → INSERT INTO masked_users_part2 ... ... // Final: INSERT INTO masked_users SELECT * FROM masked_users_part1 UNION ALL ...
Option 2: Single writer thread with batching
// Multiple reader/processor threads → Channel → one writer thread
// Writer batches inserts for efficiency:
val batch = mutableListOf<Row>()
for (row in channel) {
batch.add(row)
if (batch.size >= 5000) {
executeBatch(batch) // JDBC batch insert: 5000 rows at once
batch.clear()
}
}
Option 3: COPY command (PostgreSQL fastest path)
// PostgreSQL COPY is 5-10x faster than INSERT:
val copyManager = CopyManager(connection as BaseConnection)
val writer = copyManager.copyIn("COPY masked_users FROM STDIN WITH CSV")
// Write rows as CSV directly to the COPY stream
// No SQL parsing, no transaction per row. Pure bulk loading.
For a data engine: COPY is the standard approach for bulk loading into PostgreSQL. MySQL has LOAD DATA INFILE (similar concept). The key insight: INSERT is optimized for OLTP (small transactions). COPY/LOAD DATA is optimized for ETL (bulk loading).
"Work deeply with relational databases, add support for new databases"
Pages: PostgreSQL stores data in 8KB pages (blocks). Each table is a file on disk, divided into 8KB pages. Each page contains row data (tuples) + free space + item pointers.
Tuples: each row is a tuple with a header (23 bytes: xmin/xmax transaction IDs for MVCC, null bitmap) + actual data. Rows are not sorted — new rows go into any page with free space.
TOAST: values larger than ~2KB are compressed and/or stored in a separate TOAST table. So a TEXT column with 1MB of data doesn't bloat the main table pages.
Indexes: separate files. BTree index: sorted tree of (key, pointer-to-heap-tuple). Index scan: traverse tree → find pointer → fetch tuple from heap page.
Why this matters for data masking:
When you mask a column, you're rewriting tuples. If the masked value is a different size (shorter email, longer name), the tuple size changes. PostgreSQL handles this with HOT (Heap-Only Tuple) updates when possible, but large-scale masking often means rewriting the entire table — essentially CREATE TABLE masked AS SELECT ... FROM original.
MVCC (Multi-Version Concurrency Control): readers don't block writers, writers don't block readers. Each transaction sees a consistent snapshot of the data.
How PostgreSQL implements it: each row has xmin (transaction that created it) and xmax (transaction that deleted/updated it). An UPDATE creates a NEW tuple (new xmin) and marks the old one as deleted (sets xmax). Old versions remain on disk until VACUUM removes them.
Why this matters for a data engine:
If you're reading a 10GB table while production writes are happening, MVCC guarantees you see a consistent snapshot — no dirty reads, no torn rows. But: long-running reads prevent VACUUM from cleaning old versions → table bloat. For TB-scale reads, this is a real concern. Solution: read from a replica, or use pg_dump with snapshot isolation.
MySQL difference: InnoDB uses MVCC with undo logs. Old versions stored in undo tablespace, not inline. Different performance characteristics for long-running reads.
JDBC DatabaseMetaData:
val meta = connection.metaData
// All tables:
val tables = meta.getTables(catalog, schema, "%", arrayOf("TABLE"))
while (tables.next()) {
val tableName = tables.getString("TABLE_NAME")
}
// Columns for each table:
val columns = meta.getColumns(catalog, schema, tableName, "%")
while (columns.next()) {
val colName = columns.getString("COLUMN_NAME")
val colType = columns.getString("TYPE_NAME") // varchar, int4, etc.
val nullable = columns.getString("IS_NULLABLE")
}
// Foreign keys — critical for referential integrity:
val fks = meta.getImportedKeys(catalog, schema, tableName)
while (fks.next()) {
val fkColumn = fks.getString("FKCOLUMN_NAME")
val pkTable = fks.getString("PKTABLE_NAME")
val pkColumn = fks.getString("PKCOLUMN_NAME")
// users.department_id → departments.id
}
// Primary keys, indexes, unique constraints similarly
PostgreSQL information_schema (alternative):
SELECT table_name, column_name, data_type, is_nullable FROM information_schema.columns WHERE table_schema = 'public' ORDER BY table_name, ordinal_position;
For data masking: you need the full schema graph — tables, columns, types, foreign keys, unique constraints, check constraints. Foreign keys tell you: if you mask users.email, you must also mask orders.customer_email if it references the same data. Referential integrity must be preserved.
PostgreSQL = row-oriented (OLTP): stores all columns of a row together. Fast for: SELECT * FROM users WHERE id = 123 (fetch one complete row). Slow for: SELECT AVG(age) FROM users (must read every row's entire data to extract one column).
ClickHouse = column-oriented (OLAP): stores each column separately. Fast for: SELECT AVG(age) FROM users (reads only the age column — much less I/O). Slow for: SELECT * FROM users WHERE id = 123 (must reassemble row from many column files).
ClickHouse specifics: designed for analytical queries on huge datasets (billions of rows). No UPDATE/DELETE (append-only, merge-on-read). No transactions. Massively parallel query execution. Compression per column (similar values compress well).
For the data engine: adding "support for ClickHouse" means understanding that you can't UPDATE rows in place (no masking via UPDATE). You must read data → mask → write to a new table. Also: ClickHouse has different data types, different SQL dialect, different JDBC driver behavior.
"Strong CS fundamentals: algorithms and data structures"
This is a graph problem. Tables = nodes. Foreign keys = directed edges. You need topological sort to determine processing order (parent tables first).
// Build dependency graph:
data class Table(val name: String)
data class ForeignKey(val from: Table, val to: Table)
val graph = mutableMapOf<Table, MutableList<Table>>() // adjacency list
for (fk in allForeignKeys) {
graph.getOrPut(fk.from) { mutableListOf() }.add(fk.to)
}
// Topological sort (Kahn's algorithm):
fun topologicalSort(graph: Map<Table, List<Table>>): List<Table> {
val inDegree = mutableMapOf<Table, Int>()
// ... count incoming edges
val queue: Queue<Table> = ArrayDeque()
// ... add nodes with inDegree 0
val sorted = mutableListOf<Table>()
while (queue.isNotEmpty()) {
val node = queue.poll()
sorted.add(node)
for (neighbor in graph[node].orEmpty()) {
inDegree[neighbor] = inDegree[neighbor]!! - 1
if (inDegree[neighbor] == 0) queue.add(neighbor)
}
}
return sorted // process tables in this order
}
Why topological sort matters: if orders has FK to users, you must mask users first, then orders — so masked user IDs are consistent across tables.
Cycle detection: circular FKs are rare but possible. If topological sort doesn't include all tables → cycle exists. Handle by temporarily dropping constraints, processing, re-adding.
Requirements: each original email maps to a unique masked email. Same original → same masked (deterministic for referential integrity). Format looks realistic. Can't reverse to original.
// Approach: Format-Preserving Hash
fun maskEmail(original: String): String {
val (local, domain) = original.split("@")
// Deterministic hash — same input always gives same output:
val hash = MessageDigest.getInstance("SHA-256")
.digest("$SALT:$original".toByteArray())
// Convert to readable format:
val maskedLocal = Base36.encode(hash.take(8)) // "a7x2k9m1"
val maskedDomain = "example.com" // or hash the domain too
return "$maskedLocal@$maskedDomain"
}
// original: alex@gmail.com → masked: a7x2k9m1@example.com
// Same input always gives same output
// Different inputs give different outputs (collision probability ~1/2^64)
Performance for 100M emails: SHA-256 is ~500ns per hash. 100M * 500ns = 50 seconds. Parallelizable across 8 cores → ~6 seconds. Memory: only one email in memory at a time (streaming).
Uniqueness guarantee: SHA-256 collision probability for 100M items is astronomically low (~10^-53). In practice: guaranteed unique. If absolute uniqueness required: hash + check against a HashSet (RAM: 100M * ~50 bytes = ~5GB) or BloomFilter for approximate check.
Problem: production database has 500 tables, 10TB. Developer needs a 100GB subset for testing that's consistent (no broken foreign keys).
Algorithm — walk the FK graph from seed:
// 1. User selects "seed": I want 1000 users from the 'users' table
// 2. Engine walks FK graph:
// users(1000 rows)
// → orders (all orders for these 1000 users)
// → order_items (all items for these orders)
// → products (all products referenced by order_items)
// → categories (all categories for these products)
// Recursively follow all FKs
// 3. BFS/DFS traversal:
fun collectSubset(seedTable: String, seedIds: Set<Long>) {
val collected = mutableMapOf<String, MutableSet<Long>>()
val queue: Queue<Pair<String, Set<Long>>> = ArrayDeque()
queue.add(seedTable to seedIds)
while (queue.isNotEmpty()) {
val (table, ids) = queue.poll()
if (collected.containsKey(table)) continue
collected[table] = ids.toMutableSet()
// Find all tables that reference this table:
for (fk in foreignKeysReferencingTable(table)) {
val childIds = fetchChildIds(fk, ids)
queue.add(fk.childTable to childIds)
}
// Find all tables this table references:
for (fk in foreignKeysFromTable(table)) {
val parentIds = fetchReferencedIds(fk, ids)
queue.add(fk.parentTable to parentIds)
}
}
return collected // table → set of IDs to include
}
Challenges: cycle detection (table A references B, B references A), size control (subset might grow exponentially through many-to-many relationships), performance (hundreds of queries to resolve FKs).
Know the problem space even if you haven't built it before
Data masking = replacing sensitive data with realistic but fake data. So developers/testers can work with production-like data without accessing real PII.
Types:
Static masking: create a masked COPY of the database. Original untouched, copy has fake data. Developers access the copy. Most common for testing environments.
Dynamic masking: mask data on-the-fly at query time. Same database, different views for different users. DBA sees real data, developer sees masked. More complex, used for production access control.
Masking techniques by data type:
Email: alex.budanov@gmail.com → user_7a3f@masked.com Phone: +7 702 365 6813 → +7 702 XXX XXXX Name: Alexander Budanov → James Wilson (from name dictionary) SSN: 123-45-6789 → 987-65-4321 (format-preserving) Address: 123 Main St → 456 Oak Ave (from address generator) Date: 1985-03-15 → shift by random offset (preserve age distribution) Salary: $150,000 → $147,832 (noise within ±10%)
Key requirements: referential integrity (same person_id in all tables maps to same masked identity), format preservation (masked phone looks like a phone), statistical distribution preservation (age distribution in masked data resembles original), deterministic (same input → same output across multiple runs).
FPE encrypts data while preserving its format. A 16-digit credit card number encrypts to another 16-digit number. A 10-character name encrypts to another 10-character name.
Unlike regular encryption (AES produces random-looking bytes), FPE output looks like valid data. Applications that validate format (length, character set, check digits) still work with FPE-encrypted data.
Standards: FF1 and FF3-1 (NIST SP 800-38G). Based on AES internally but with Feistel network structure adapted for small domains.
Use case in data masking: mask a credit card number so it still passes Luhn check. Mask a phone number so it still has correct country code format. The masked data is deterministic (same key + same input = same output) enabling referential integrity across tables.
Synthetic data = entirely generated data that has the same statistical properties as real data but contains no real information.
Approaches:
Rule-based: define rules per column. "Name = random from dictionary, Age = normal distribution(35, 10), Email = name + random domain." Simple, fast, predictable.
Statistical: analyze real data distributions and correlations, then generate new data matching those distributions. Preserves: column distributions, correlations between columns, cardinality. Doesn't preserve: individual records.
AI/ML-based: train a generative model (GAN, VAE, diffusion) on real data. Model learns patterns and generates new data. Most realistic but: risk of memorizing real records, harder to guarantee privacy, slower.
For the company's product: likely a combination. Rule-based for simple columns (names, emails), statistical for numerical data (preserving distributions), schema-aware for maintaining FK relationships and constraints.
"Senior Software Engineer, 25 years of experience. 15 years at Alfa-Bank building trading infrastructure — high-performance systems working with large datasets and relational databases. System Architect at Moscow Exchange — designed trading system architecture, optimized latency and throughput. Amazon — distributed systems with Java and Kotlin on AWS. My core strengths: JVM performance, concurrency, deep database knowledge, and the ability to solve ambiguous engineering problems."
"Two things excite me. First, the deep technical challenge — building a core data engine that works with database internals, not just CRUD APIs. I've spent my career at the JVM-meets-database intersection: trading systems, exchange infrastructure, financial data processing. Second, the data masking problem requires understanding referential integrity, schema graphs, and data distributions at a deep level — my 15 years of banking SQL gives me that foundation."
"My primary background is C#/.NET — 15+ years. Java — about 4 years at Alfa-Bank and Amazon. Kotlin — used at Amazon. But at the senior level, the language is the easy part. JVM internals — GC tuning, concurrency, profiling — that's the same whether I write Java or Kotlin. The patterns are identical: threading, memory management, JDBC, database internals. I'm productive in Kotlin and actively deepening my skills."
"At Moscow Exchange — trading system where microsecond latency mattered. Optimized Hazelcast distributed data structures, designed WebSocket protocol for market data streaming. At Amazon — distributed systems processing events at scale using Kafka and DynamoDB. I understand the full spectrum: from low-level CAS operations and lock-free data structures to high-level patterns like producer-consumer, fork-join, and coroutines."
"15 years at Alfa-Bank working with financial databases — query optimization, indexing strategies, transaction management. At Moscow Exchange — Firebird database for trading data. I understand database internals: how PostgreSQL stores data in pages, MVCC, WAL, VACUUM. For this role — connecting to different databases, reading schemas, understanding FK relationships, processing data efficiently — this is exactly my background."
"At Moscow Exchange — processing real-time market data from all trading instruments simultaneously. At Alfa-Bank — 15 years of trading history, portfolio calculations across thousands of positions. I understand the techniques: streaming with JDBC fetchSize, batch processing, parallel chunk processing, off-heap memory for large datasets, choosing the right GC for the workload."
"As System Architect at Moscow Exchange, I regularly faced problems with no ready-made solution — designing custom protocols, optimizing specific bottlenecks, choosing between competing approaches. I'm comfortable with research-heavy work: reading database documentation, profiling to find the real bottleneck, prototyping solutions, and iterating based on measurements."
This is NOT a microservices/fintech role. Don't lead with: Kafka, Spring Boot, REST APIs, Docker/K8s, event-driven architecture. These are secondary. Lead with: JVM performance, concurrency, database internals, algorithms, large-scale data processing.
Problems that mirror real Synthesized.io core engine challenges. Write in Kotlin/Java.
ALGORITHMS GRAPH
Problem: Given a list of tables and foreign key relationships, determine the order to process tables so that parent tables are always processed before children. Detect cycles.
data class ForeignKey(
val fromTable: String, // child table
val fromColumn: String,
val toTable: String, // parent table
val toColumn: String
)
/**
* Return tables in topological order (parents first).
* Throw IllegalStateException if cycle detected.
*
* Example:
* tables = ["orders", "users", "order_items", "products"]
* fks = [
* FK(orders.user_id -> users.id),
* FK(order_items.order_id -> orders.id),
* FK(order_items.product_id -> products.id)
* ]
* Result: ["users", "products", "orders", "order_items"]
* (users and products first — no dependencies. Then orders. Then order_items.)
*/
fun topologicalSort(tables: List<String>, fks: List<ForeignKey>): List<String> {
// Your code here — Kahn's algorithm (BFS with in-degree)
TODO()
}
// Bonus: what if there IS a cycle?
// e.g. employees.manager_id -> employees.id (self-reference)
// How would Synthesized handle this? (Hint: CycleResolutionStrategy in their docs)
DATA MASKING HASHING
Problem: Implement a masking function that replaces emails with fake ones. Must be deterministic (same input = same output for referential integrity across tables) and irreversible.
import java.security.MessageDigest
/**
* Mask an email deterministically.
* Same input always produces the same output.
* Different inputs produce different outputs.
* Output looks like a valid email.
*
* Example:
* mask("alex@gmail.com") -> "a7x2k9@masked.io"
* mask("alex@gmail.com") -> "a7x2k9@masked.io" (same!)
* mask("maria@yahoo.com") -> "m3p8q1@masked.io" (different)
*
* Requirements:
* - Use SHA-256 with a salt for irreversibility
* - Local part: 6 alphanumeric chars derived from hash
* - Domain: always "masked.io"
* - Must handle null input (return null)
*/
fun maskEmail(email: String?, salt: String = "s3cret"): String? {
// Your code here
TODO()
}
// Bonus: How to ensure uniqueness for 100M emails?
// Bonus: What about format-preserving encryption (FPE)?
GRAPH BFS DATABASE CORE FEATURE
Problem: Given a database schema with FKs, extract a consistent subset starting from seed rows. All FK references must be satisfied in the subset.
data class Table(val name: String)
data class FK(val child: String, val childCol: String, val parent: String, val parentCol: String)
/**
* Given seed rows from a starting table, find ALL rows across
* ALL tables that must be included for referential integrity.
*
* Example schema:
* users(id, name, email)
* orders(id, user_id -> users.id, total)
* order_items(id, order_id -> orders.id, product_id -> products.id)
* products(id, name, price)
*
* Seed: users where id IN (1, 2)
* Result:
* users: {1, 2}
* orders: {10, 11, 12} (all orders for users 1,2)
* order_items: {100..108} (all items for those orders)
* products: {5, 7, 12} (all products referenced by those items)
*
* Algorithm: BFS from seed, following FKs in BOTH directions:
* - Forward (parent to child): users -> orders -> order_items
* - Backward (child to parent): order_items -> products
*/
fun collectSubset(
seedTable: String,
seedIds: Set<Long>,
fks: List<FK>,
fetchRelatedIds: (table: String, column: String, ids: Set<Long>) -> Set<Long>
): Map<String, Set<Long>> {
// Your code here — BFS through FK graph
TODO()
}
// Think about:
// 1. How to avoid infinite loops with circular FKs?
// 2. What if the subset grows exponentially? (100 users -> 10M order_items)
// 3. How to parallelize this for performance?
CONCURRENCY JVM PERFORMANCE
Problem: Process rows from a large table using producer-consumer pattern. One reader thread streams from DB, N worker threads transform data. Backpressure when workers are slow.
import java.sql.Connection
import java.util.concurrent.*
/**
* Read all rows from a table, transform each row, write to output.
* Requirements:
* - Single reader thread (JDBC is not thread-safe per connection)
* - N worker threads for CPU-intensive transformation
* - Backpressure: if workers are slow, reader pauses (bounded queue)
* - Batch writes for output (every 5000 rows)
* - Track progress: log every 100,000 rows processed
*
* Think about:
* - Why fetchSize matters (hint: PostgreSQL loads ALL rows by default!)
* - How to handle worker exceptions without losing data
* - Graceful shutdown (SIGTERM while processing 10M rows)
* - GC pressure: reuse objects? StringBuilder? IntArray vs List?
*/
fun processTable(
sourceConn: Connection,
targetConn: Connection,
tableName: String,
workerCount: Int = 8,
transform: (row: Map<String, Any?>) -> Map<String, Any?>
) {
// Your code here — BlockingQueue or Channel based
TODO()
}
// Variant: implement the same with Kotlin coroutines + Channel
// Compare: which approach is simpler? Which uses less memory?
DATABASE INTERNALS JDBC
Problem: Read complete schema information from a database using JDBC DatabaseMetaData. Build an in-memory model of tables, columns, types, PKs, and FKs.
import java.sql.Connection
data class Column(val name: String, val type: String, val nullable: Boolean, val size: Int)
data class PrimaryKey(val columns: List<String>)
data class ForeignKey(
val name: String,
val columns: List<String>,
val referencedTable: String,
val referencedColumns: List<String>
)
data class TableSchema(
val name: String,
val columns: List<Column>,
val primaryKey: PrimaryKey?,
val foreignKeys: List<ForeignKey>
)
/**
* Read all tables from a database schema using JDBC metadata API.
* Return a map of table_name -> TableSchema.
*
* Use: connection.metaData.getTables(), .getColumns(),
* .getPrimaryKeys(), .getImportedKeys()
*
* Handle: composite PKs (multiple columns), composite FKs,
* nullable columns, different SQL types.
*/
fun readSchema(conn: Connection, schemaName: String = "public"): Map<String, TableSchema> {
// Your code here
TODO()
}
// Think about:
// 1. Different databases return different type names (int4 vs INTEGER vs INT)
// 2. Some databases don't support getImportedKeys (need information_schema fallback)
// 3. Virtual FKs — relationships not declared in DB but known to users
DESIGN PATTERNS KOTLIN
Problem: Design a transformer registry — user configures masking/generation rules per column type (email, phone, name, date, number), engine applies the right transformer to each column.
/**
* Design a transformer system:
* - Each transformer handles one type of data (email, phone, name...)
* - User configures via YAML which transformer to use per column
* - Transformers must be deterministic (same input -> same output)
* - New transformers can be added without modifying existing code
*
* Example YAML config (Synthesized's actual format):
* tables:
* users:
* columns:
* email: { transformer: "email_masker" }
* phone: { transformer: "phone_masker", format: "+X XXX XXX XXXX" }
* name: { transformer: "name_generator", locale: "en" }
* salary: { transformer: "numeric_noise", variance: 0.1 }
* birth_date: { transformer: "date_shift", max_days: 30 }
*/
sealed interface Transformer {
fun transform(value: Any?, config: Map<String, Any>): Any?
}
class EmailMasker(private val salt: String) : Transformer {
override fun transform(value: Any?, config: Map<String, Any>): Any? {
// implement deterministic email masking
TODO()
}
}
class TransformerRegistry {
// Register and lookup transformers by name
// Apply the right transformer to each column based on config
TODO()
}
// Think about:
// 1. How to make this extensible (new transformer types via plugins?)
// 2. Thread safety — transformers called from multiple threads
// 3. How does Synthesized handle cross-table consistency?
// (same email in users.email and orders.contact_email must mask identically)
Don't memorize solutions — understand the patterns. They might ask a variation. The key signals they're looking for:
1. Graph thinking: FK relationships form a DAG. Topological sort, cycle detection, BFS/DFS traversal. You dealt with dependency graphs in trading systems.
2. Performance awareness: fetchSize, batch writes, COPY, off-heap memory, GC pressure. Mention these proactively — it shows you've worked with large data.
3. Database internals: JDBC metadata API, differences between PostgreSQL/MySQL/Oracle type systems, MVCC, streaming vs buffered results.
4. Concurrency patterns: producer-consumer, bounded queues, thread safety, graceful shutdown. Raw Java preferred over coroutines for this role.
5. Domain knowledge: referential integrity in masking, deterministic transformations, consistent subsetting. Shows you understand the business problem.