Fixing Knex's GROUP BY Parameter Duplication Issue

While #2806 is a documentation issue, I do believe we should try and solve the problem properly.

Thing is, when using the same Raw expression in both SELECT and GROUP BY clauses, Knex creates separate parameter bindings, causing PostgreSQL to treat them as different expressions:

```javascript
const field = knex.raw('??#>>?', ['attributes', ['job']]);
knex('users').select(field).groupBy(field);

// Generates:
// SELECT "attributes"#>>$1 FROM users GROUP BY "attributes"#>>$2
// Parameters: [['job'], ['job']]
```

PostgreSQL requires exact expression matching between SELECT and GROUP BY, so it throws: `column must appear in the GROUP BY clause or be used in an aggregate function`.

What can be done?

## Possible Solutions

### 1. Parameter Deduplication System (Most Elegant)

Implement a system that tracks and reuses parameter positions for identical values:

```javascript
class ParameterManager {
  constructor() {
    this.params = [];
    this.paramMap = new Map(); // Map of value -> position
  }
  
  addParameter(value) {
    const key = JSON.stringify(value); // or a hash
    if (this.paramMap.has(key)) {
      return this.paramMap.get(key); // Return existing position
    }
    this.params.push(value);
    const position = this.params.length;
    this.paramMap.set(key, position);
    return position;
  }
}
```

**Result:** `SELECT "attributes"#>>$1 FROM users GROUP BY "attributes"#>>$1`

**Pros:**
- Maintains security with parameterized queries
- Fixes the issue transparently
- Reduces redundant parameters (performance benefit)

**Cons:**
- Breaking change if anyone relies on parameter order <--- This is the biggest problem
- Complex handling for objects/arrays
- Needs careful implementation to avoid hash collisions

### 2. Raw Instance Tracking (Pragmatic)

Track when the same Raw instance is used multiple times and compile it only once:

```javascript
class QueryCompiler {
  constructor() {
    this.compiledRaws = new WeakMap();
  }
  
  compileRaw(rawInstance) {
    if (this.compiledRaws.has(rawInstance)) {
      return this.compiledRaws.get(rawInstance);
    }
    const compiled = /* compile the raw */;
    this.compiledRaws.set(rawInstance, compiled);
    return compiled;
  }
}
```

**Pros:**
- Only affects identical Raw instances
- Backwards compatible
- Simpler to implement

**Cons:**
- Only works if users reuse the same instance
- Doesn't help if users create separate but identical Raw expressions
- Limited scope of fix

### 3. Expression Registry with Automatic Aliases (Most Robust)

Automatically generate and track aliases for complex expressions:

```javascript
// Knex internally generates:
// SELECT: "attributes"#>>$1 AS __expr_hash_abc123
// GROUP BY: __expr_hash_abc123
```

**Pros:**
- Always works regardless of how expressions are created
- No parameter duplication issues
- PostgreSQL handles it correctly

**Cons:**
- Changes output column names (unless carefully managed)
- More complex SQL generation
- May affect existing code expecting specific column names

### 4. Dialect-Specific GROUP BY Optimization (PostgreSQL-specific)

For PostgreSQL, detect when GROUP BY contains expressions from SELECT and use positional references or aliases:

```javascript
// Option A: Positional
SELECT "attributes"#>>$1 AS job FROM users GROUP BY 1

// Option B: Alias reference  
SELECT "attributes"#>>$1 AS job FROM users GROUP BY job
```

**Pros:**
- Solves the PostgreSQL issue directly
- Clean SQL output
- No parameter duplication

**Cons:**
- PostgreSQL-specific solution
- Positional GROUP BY can be fragile with query modifications
- Requires detecting which expressions in GROUP BY match SELECT

### 5. New API Methods (Explicit Control)

Add new methods that give users explicit control over expression reuse:

```javascript
// Method 1: groupByAlias - reference a column alias
knex('users')
  .select(knex.raw('??#>>?', ['attributes', ['job']]).as('job'))
  .groupByAlias('job')  // New method!

// Method 2: expression objects that track their usage
const expr = knex.expression('??#>>?', ['attributes', ['job']]);
knex('users')
  .select(expr.as('job'))
  .groupBy(expr)  // Knex knows to reuse the same parameter positions

// Method 3: explicit parameter sharing
knex('users')
  .select(knex.raw('??#>>?', ['attributes', ['job']]).as('job'))
  .groupBy(knex.raw.ref('job'))  // Reference the alias
```

**Pros:**
- Backwards compatible (new opt-in features)
- Explicit and predictable
- Can ship in a minor version
- Gives users immediate workarounds

**Cons:**
- Requires users to change their code
- Doesn't fix existing code automatically
- Multiple ways to do the same thing (API complexity)

We could combine #1 and #5

### Short Term (Immediate Relief)

Implement **Solution #5 (New API Methods)** first:
- Ship in next minor version (e.g., 3.2.0)
- Provides immediate workarounds for affected users
- Fully backwards compatible
- Clear migration path

Example implementation:
```javascript
// New methods available immediately
knex('users')
  .select(knex.raw('??#>>?', ['attributes', ['job']]).as('job_field'))
  .groupByAlias('job_field')  // New!

// Or with expression tracking
const jobExpr = knex.expression('??#>>?', ['attributes', ['job']]);
knex('users')
  .select(jobExpr.as('job'))
  .groupBy(jobExpr)  // Reuses same parameters
```

### Long Term (Proper Fix)

Implement **Solution #1 (Parameter Deduplication)** with careful rollout:
1. **Version 3.3.0:** Add behind feature flag
   ```javascript
   knex({ 
     deduplicateParameters: true  // Opt-in
   })
   ```

2. **Version 3.4.0:** Enable by default with opt-out
   ```javascript
   knex({ 
     deduplicateParameters: false  // Opt-out for compatibility
   })
   ```

3. **Version 4.0.0:** Remove flag, always deduplicate

## Why This Combination?

1. **Immediate Relief:** Users suffering from this issue get solutions now
2. **Proper Fix:** The underlying problem gets solved correctly
3. **Migration Path:** Clear upgrade path with feature flags
4. **Backwards Compatibility:** No breaking changes until major version
5. **Best Practices:** Maintains parameterized queries throughout

## Implementation Priority

1. **First:** Add `groupByAlias()` method (easiest, most useful)
2. **Second:** Add `knex.expression()` for trackable expressions
3. **Third:** Implement parameter deduplication behind flag
4. **Finally:** Make deduplication default in next major version

## Alternative for Users (Available Now)

Until these fixes are implemented, users can work around the issue:

```javascript
// Workaround: Use aliases
knex('users')
  .select(knex.raw('??#>>? as job_field', ['attributes', ['job']]))
  .groupBy('job_field')

// This works because GROUP BY references the alias, not the expression
```

## Conclusion

The parameter duplication issue represents a fundamental tension between:
- PostgreSQL's strict expression matching requirements
- Security best practices (parameterized queries)
- Query builder architecture (independent clause compilation)

The recommended dual approach provides both immediate relief and a long-term solution, maintaining security while fixing the underlying architectural issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixing Knex's GROUP BY Parameter Duplication Issue #6263

Possible Solutions

1. Parameter Deduplication System (Most Elegant)

2. Raw Instance Tracking (Pragmatic)

3. Expression Registry with Automatic Aliases (Most Robust)

4. Dialect-Specific GROUP BY Optimization (PostgreSQL-specific)

5. New API Methods (Explicit Control)

Short Term (Immediate Relief)

Long Term (Proper Fix)

Why This Combination?

Implementation Priority

Alternative for Users (Available Now)

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fixing Knex's GROUP BY Parameter Duplication Issue #6263

Description

Possible Solutions

1. Parameter Deduplication System (Most Elegant)

2. Raw Instance Tracking (Pragmatic)

3. Expression Registry with Automatic Aliases (Most Robust)

4. Dialect-Specific GROUP BY Optimization (PostgreSQL-specific)

5. New API Methods (Explicit Control)

Short Term (Immediate Relief)

Long Term (Proper Fix)

Why This Combination?

Implementation Priority

Alternative for Users (Available Now)

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions