Skip to content

Feature: Performance Optimization for Large-Scale Generation #63

@Goldziher

Description

@Goldziher

Feature: Performance Optimization for Large-Scale Generation

Add performance optimizations for generating large datasets (10k+ records) to make Interface-Forge suitable for big data testing scenarios.

Problem Statement

When generating large numbers of records (10,000+), the current implementation can face memory and performance challenges:

  • Memory usage grows linearly with batch size
  • No built-in progress tracking for long operations
  • CPU-bound operations block the event loop
  • No way to process data in chunks

Proposed Features

1. Streaming/Chunking API

// Generate data in manageable chunks
const stream = factory.stream({ 
  chunkSize: 1000,
  total: 100000 
});

stream.on('data', async (chunk: T[]) => {
  // Process each chunk (e.g., bulk insert to DB)
  await db.batchInsert(chunk);
});

stream.on('end', () => {
  console.log('Generation complete');
});

stream.on('error', (error) => {
  console.error('Generation failed:', error);
});

2. Memory-Efficient Generation

  • Implement garbage collection hints between chunks
  • Option to generate and immediately persist without holding in memory
  • Lazy evaluation for large nested structures

3. Parallel Generation with Worker Threads

const factory = new Factory<User>(/* ... */, {
  parallel: {
    enabled: true,
    workers: 4 // Number of worker threads
  }
});

// Utilizes multiple CPU cores
const users = await factory.batchAsync(50000);

4. Progress Callbacks

const users = await factory.batchAsync(100000, {
  onProgress: (current, total, percentage) => {
    console.log(`Generated ${current}/${total} (${percentage}%)`);
  },
  progressInterval: 1000 // Report every 1000 items
});

5. Benchmarking Suite

  • Add performance benchmarks to CI
  • Track generation speed over time
  • Memory usage profiling
  • Comparison with other factory libraries

Implementation Details

  1. Streaming Implementation

    • Use Node.js streams API
    • Support backpressure handling
    • Allow custom transform streams
  2. Memory Management

    • Implement chunk-based generation
    • Clear internal caches between chunks
    • Option to disable caching for large operations
  3. Worker Thread Support

    • Serialize factory configuration to workers
    • Distribute work evenly across threads
    • Merge results efficiently
  4. Progress Tracking

    • Non-blocking progress updates
    • Configurable update frequency
    • ETA calculation

Performance Goals

  • Generate 1M simple records in < 30 seconds
  • Memory usage should plateau (not grow linearly)
  • Support concurrent generation without blocking
  • Maintain type safety throughout

Example Use Cases

// Database seeding
await UserFactory.stream({ chunkSize: 5000 })
  .pipe(new DatabaseWriter(db))
  .on('finish', () => console.log('Database seeded'));

// CSV export
const csvStream = ProductFactory.stream({ chunkSize: 1000 })
  .pipe(new CSVTransform())
  .pipe(fs.createWriteStream('products.csv'));

// Real-time generation API
app.get('/generate/:count', async (req, res) => {
  res.setHeader('Content-Type', 'application/x-ndjson');
  
  factory.stream({ 
    chunkSize: 100, 
    total: req.params.count 
  })
  .pipe(new JSONLinesTransform())
  .pipe(res);
});

Testing Requirements

  • Benchmark tests for various data sizes
  • Memory leak tests
  • Worker thread stability tests
  • Stream backpressure handling tests
  • Progress accuracy tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions