这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@KungFuJesus
Copy link
Contributor

@KungFuJesus KungFuJesus commented Nov 29, 2024

For AVX2 specifically, our chunkmemset call chain allows for safely dealing with half sized chunks. Because this doesn't alter the code size in any meaningful way, we may as well use macros to define the constant we use for comparison in the "safe" variety of functions.

Summary by CodeRabbit

  • New Features

    • Introduced a new macro MIN_CHUNKSIZE to standardize minimum chunk sizes across various architectures.
    • Updated memory operation functions to utilize the new MIN_CHUNKSIZE for improved safety and consistency.
  • Bug Fixes

    • Enhanced error handling in memory operations to prevent exceeding buffer limits.
  • Documentation

    • Minor adjustments to comments for clarity regarding memory operations.

For AVX2 specifically, our chunkmemset call chain allows for safely
dealing with half sized chunks. Because this doesn't alter the code
size in any meaningful way, we may as well use macros to define the
constant we use for comparison in the "safe" variety of functions.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 29, 2024

Walkthrough

The changes in this pull request include the introduction of a new macro definition, MIN_CHUNKSIZE, in multiple architecture-specific files, establishing a minimum chunk size based on the size of chunk_t or halfchunk_t, depending on the architecture. The macro is incorporated into existing functions related to memory operations, enhancing the robustness of chunk size handling without altering existing function signatures or control flows. Additionally, the chunkset_tpl.h file has been updated to replace direct size checks with the new macro, refining the logic for memory operations.

Changes

File Path Change Summary
arch/arm/chunkset_neon.c Added macro: #define MIN_CHUNKSIZE sizeof(chunk_t)
arch/generic/chunkset_c.c Added macro: #define MIN_CHUNKSIZE sizeof(chunk_t)
arch/power/chunkset_power8.c Added macro: #define MIN_CHUNKSIZE sizeof(chunk_t)
arch/riscv/chunkset_rvv.c Added macro: #define MIN_CHUNKSIZE CHUNK_SIZE; updated CHUNK_MEMSET_RVV_IMPL to use it.
arch/x86/chunkset_avx2.c Added macro: #define MIN_CHUNKSIZE sizeof(halfchunk_t)
arch/x86/chunkset_sse2.c Added macro: #define MIN_CHUNKSIZE sizeof(chunk_t)
arch/x86/chunkset_ssse3.c Added macro: #define MIN_CHUNKSIZE sizeof(chunk_t)
chunkset_tpl.h Updated CHUNKMEMSET_SAFE and CHUNKCOPY_SAFE to use MIN_CHUNKSIZE instead of sizeof(chunk_t); refined logic for memory operations.

Possibly related PRs

Suggested labels

optimization

Suggested reviewers

  • Dead2

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@KungFuJesus KungFuJesus changed the title Allow more cases to fall through to CHUNKMEMSET/CHUNKCOPY Draft: Allow more cases to fall through to CHUNKMEMSET/CHUNKCOPY Nov 29, 2024
@KungFuJesus KungFuJesus marked this pull request as draft November 29, 2024 21:46
@KungFuJesus KungFuJesus changed the title Draft: Allow more cases to fall through to CHUNKMEMSET/CHUNKCOPY Allow more cases to fall through to CHUNKMEMSET/CHUNKCOPY Nov 29, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (6)
arch/generic/chunkset_c.c (1)

13-13: Consider adding a brief comment explaining the macro's purpose.

While the macro's name is self-explanatory, a brief comment would help document its role in the safety checks for chunk operations.

+/* Minimum size for safe chunk operations */
 #define MIN_CHUNKSIZE sizeof(chunk_t)
arch/x86/chunkset_sse2.c (1)

15-15: Consider documenting the performance implications.

The introduction of MIN_CHUNKSIZE allows for more consistent handling of chunk sizes across different architectures while potentially improving performance by allowing more cases to fall through to the optimized SIMD implementations. Consider:

  1. Adding a comment explaining the rationale behind the minimum chunk size choice
  2. Documenting any performance implications in the relevant documentation
  3. Adding benchmarks to measure the impact of this change

Would you like me to help draft the documentation or create benchmark templates?

Also applies to: 44-44

arch/power/chunkset_power8.c (1)

16-16: Architectural improvement for vectorized operations

This change enables more efficient vectorized operations while maintaining safety:

  1. Using sizeof(chunk_t) automatically adapts to the architecture's vector size
  2. Allows the compiler to optimize size checks better
  3. Maintains consistency with the VSX implementation's capabilities

Consider documenting these architectural assumptions and requirements in a comment above the macro, especially noting:

  • The relationship with VSX vector size
  • The connection to CHUNK_SIZE
  • The usage in safety checks
+/* MIN_CHUNKSIZE defines the minimum safe chunk size for memory operations.
+ * For POWER8 VSX, this matches the vector size (16 bytes) and CHUNK_SIZE.
+ * This is used in safety checks for chunkmemset operations.
+ */
#define MIN_CHUNKSIZE sizeof(chunk_t)
arch/riscv/chunkset_rvv.c (2)

Line range hint 89-89: Enhance the assertion message for better debugging.

While the assertion correctly checks for non-zero length, the error message could be more descriptive to aid in debugging.

-    Assert(len > 0, "chunkcopy should never have a length 0");
+    Assert(len > 0, "CHUNKCOPY: Invalid zero length parameter provided");

Line range hint 71-86: Consider benchmarking the memory copying strategies.

The implementation intelligently uses different strategies for overlapping and non-overlapping cases, leveraging glibc's RVV-optimized memcpy through IFUNC. While this approach is theoretically sound, it would be valuable to verify the performance characteristics:

  1. Non-overlapping case: Single large memcpy
  2. Overlapping case: N-chunk memcpy followed by chunk-by-chunk copying

Consider adding performance benchmarks to validate these strategies across different memory sizes and alignment scenarios.

chunkset_tpl.h (1)

Line range hint 253-285: Performance consideration: Branch prediction hints are properly used

The use of UNLIKELY macro for the size checks helps the compiler optimize the branch prediction, minimizing the performance impact on the common path. This is especially important given the performance-critical nature of these functions, as noted in the previous learnings.

The additional safety checks align well with the PR's objective of controlling chunk operation dispatch while maintaining good performance characteristics.

Consider adding comments explaining the performance implications of these checks for future maintainers.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5456966 and cc181a2.

📒 Files selected for processing (8)
  • arch/arm/chunkset_neon.c (1 hunks)
  • arch/generic/chunkset_c.c (1 hunks)
  • arch/power/chunkset_power8.c (1 hunks)
  • arch/riscv/chunkset_rvv.c (1 hunks)
  • arch/x86/chunkset_avx2.c (1 hunks)
  • arch/x86/chunkset_sse2.c (1 hunks)
  • arch/x86/chunkset_ssse3.c (1 hunks)
  • chunkset_tpl.h (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • arch/x86/chunkset_avx2.c
🧰 Additional context used
📓 Learnings (1)
chunkset_tpl.h (1)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-11-12T09:04:30.407Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.
🔇 Additional comments (8)
arch/generic/chunkset_c.c (1)

13-13: LGTM! Verify consistent usage across architectures.

The MIN_CHUNKSIZE macro is correctly defined based on sizeof(chunk_t) (8 bytes), which aligns with the generic implementation's use of uint64_t. This provides a safe minimum boundary for chunk operations.

Let's verify the consistency of MIN_CHUNKSIZE definitions across different architectures:

✅ Verification successful

MIN_CHUNKSIZE definitions are architecture-aware and correct

The verification shows that MIN_CHUNKSIZE is consistently defined according to each architecture's requirements:

  • Generic C implementation: sizeof(chunk_t) where chunk_t is uint64_t (8 bytes)
  • x86 SSE2/SSSE3: sizeof(chunk_t) where chunk_t is __m128i (16 bytes)
  • x86 AVX2: sizeof(halfchunk_t) where halfchunk_t is __m128i (16 bytes)
  • ARM NEON: sizeof(chunk_t) where chunk_t is uint8x16_t (16 bytes)
  • POWER8: sizeof(chunk_t) where chunk_t is vector unsigned char (16 bytes)
  • RISC-V: CHUNK_SIZE which is defined as 32 bytes

Each architecture correctly defines MIN_CHUNKSIZE based on its vector width and implementation requirements, maintaining proper alignment and size constraints.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check MIN_CHUNKSIZE definitions across architecture-specific implementations
# Expected: All implementations should define MIN_CHUNKSIZE based on their chunk_t size

# Search for MIN_CHUNKSIZE definitions in architecture-specific files
echo "Checking MIN_CHUNKSIZE definitions:"
rg "define\s+MIN_CHUNKSIZE" arch/

Length of output: 563


Script:

#!/bin/bash
# Let's check the chunk_t and halfchunk_t definitions in these files
echo "=== Checking chunk_t and related type definitions ==="
rg "typedef.*chunk_t" arch/ -B 2 -A 2

echo -e "\n=== Checking CHUNK_SIZE definitions ==="
rg "define.*CHUNK_SIZE" arch/ -B 2 -A 2

echo -e "\n=== Checking halfchunk_t definitions ==="
rg "typedef.*halfchunk_t" arch/ -B 2 -A 2

Length of output: 3547

arch/x86/chunkset_sse2.c (1)

Line range hint 44-44: Verify the usage of MIN_CHUNKSIZE in the template.

The inclusion of chunkset_tpl.h at this point is crucial as it will use the newly defined MIN_CHUNKSIZE macro.

Let's verify how MIN_CHUNKSIZE is used in the template:

✅ Verification successful

Let me gather more context about the usage of MIN_CHUNKSIZE in relation to CHUNKCOPY since that's what's being defined in the reviewed code.


MIN_CHUNKSIZE is correctly defined and used in the template

The verification shows that:

  • MIN_CHUNKSIZE is properly defined in chunkset_sse2.c as sizeof(chunk_t)
  • The template uses MIN_CHUNKSIZE appropriately for safety checks:
    • Before small copies to prevent underflow
    • In safe variants to ensure sufficient buffer size
  • CHUNKCOPY (which is being defined to use SSE2) is properly integrated with these safety checks
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of MIN_CHUNKSIZE in the template file
# Expected: The template should use MIN_CHUNKSIZE for size comparisons

# Check how MIN_CHUNKSIZE is used in the template
echo "Checking MIN_CHUNKSIZE usage in template:"
rg -A 3 "MIN_CHUNKSIZE" chunkset_tpl.h

# Look for any size comparisons in the safe variants
echo -e "\nChecking size comparisons in safe variants:"
ast-grep --pattern 'CHUNKMEMSET_SAFE' chunkset_tpl.h

Length of output: 661


Script:

#!/bin/bash
# Check for CHUNKCOPY usage in the template
echo "Checking CHUNKCOPY usage in template:"
rg -A 5 "CHUNKCOPY" chunkset_tpl.h

# Check MIN_CHUNKSIZE definition in the SSE2 file
echo -e "\nChecking MIN_CHUNKSIZE definition:"
rg "MIN_CHUNKSIZE" arch/x86/chunkset_sse2.c

Length of output: 1899

arch/power/chunkset_power8.c (1)

16-16: LGTM! Verify consistency with template usage.

The MIN_CHUNKSIZE definition correctly uses sizeof(chunk_t) which aligns with the VSX vector size on POWER8 and matches the CHUNK_SIZE definition. This enables efficient vectorized operations.

Let's verify the consistency across architectures and template usage:

✅ Verification successful

Let me gather more information about the template usage since the previous command failed to locate the template file.


MIN_CHUNKSIZE definition is correct and consistent with template usage

The verification shows that:

  • The MIN_CHUNKSIZE definition in arch/power/chunkset_power8.c correctly matches the template usage in chunkset_tpl.h where it's used for safety checks before memory operations
  • It's consistent with other architecture implementations:
    • Most architectures define it as sizeof(chunk_t) matching their vector types
    • AVX2 uses sizeof(halfchunk_t) due to its wider vectors
    • RISC-V uses explicit CHUNK_SIZE (32)
  • The Power8 implementation's CHUNK_SIZE of 16 bytes aligns with its chunk_t (vector unsigned char) size
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify MIN_CHUNKSIZE definitions across architectures and its usage in templates

# Check MIN_CHUNKSIZE definitions across all architecture files
echo "Checking MIN_CHUNKSIZE definitions:"
rg "MIN_CHUNKSIZE" arch/

# Verify usage in template implementation
echo -e "\nChecking template usage:"
rg "MIN_CHUNKSIZE" "*tpl.h"

# Verify chunk_t definitions to ensure size consistency
echo -e "\nChecking chunk_t definitions:"
rg "typedef.*chunk_t" arch/

Length of output: 1282


Script:

#!/bin/bash
# Find the template file first
echo "Locating template files:"
fd ".*tpl\.h"

# Check MIN_CHUNKSIZE usage in chunkset implementation
echo -e "\nChecking MIN_CHUNKSIZE usage in implementation:"
rg -A 3 "MIN_CHUNKSIZE" --type c

# Check if there are any other size-related checks in chunkset files
echo -e "\nChecking other size-related patterns:"
rg "CHUNK_SIZE|chunk_size" --type c arch/

Length of output: 3004

arch/x86/chunkset_ssse3.c (1)

17-17: LGTM! Consider adding a brief comment explaining the macro's purpose.

The MIN_CHUNKSIZE macro is correctly defined based on the SSSE3 vector size. This aligns with the PR's objective of standardizing chunk size handling across architectures.

Consider adding a brief comment explaining that this defines the minimum safe chunk size for SSSE3 operations:

+/* Minimum chunk size for safe SSSE3 vector operations */
#define MIN_CHUNKSIZE sizeof(chunk_t)

Let's verify how this constant is used in the template file:

✅ Verification successful

Let me gather more context about how these MIN_CHUNKSIZE checks are used in the template file.


The MIN_CHUNKSIZE macro is correctly used as a safety threshold

The verification confirms that MIN_CHUNKSIZE is properly used in the template file for safety checks:

  • Used as a threshold to fall back to byte-by-byte operations when remaining length is too small
  • Ensures safe handling of small chunks in both chunkcopy and chunkmemset operations
  • Correctly guards against potential buffer overruns when vector operations cannot be safely performed

The original review comment's suggestion to add documentation remains valid, but the technical implementation and usage are sound.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of MIN_CHUNKSIZE in the template implementation
# Expected: Should find usage in safety checks within chunkmemset/chunkcopy functions

# Search for MIN_CHUNKSIZE usage in template file
rg "MIN_CHUNKSIZE" "chunkset_tpl.h"

Length of output: 149


Script:

#!/bin/bash
# Get surrounding context of MIN_CHUNKSIZE usage
rg -B 5 -A 5 "MIN_CHUNKSIZE" "chunkset_tpl.h"

Length of output: 545

arch/arm/chunkset_neon.c (2)

16-16: LGTM! Macro definition aligns with NEON vector size.

The MIN_CHUNKSIZE definition as sizeof(chunk_t) is correct for NEON operations, as it maps to a 16-byte vector size which is optimal for NEON SIMD instructions.


16-16: Verify consistency with other architectures.

Let's ensure this definition maintains consistency with other architecture-specific implementations.

✅ Verification successful

MIN_CHUNKSIZE definition is consistent across architectures

The verification shows that MIN_CHUNKSIZE is consistently defined relative to the architecture-specific chunk sizes:

  • Most architectures (ARM NEON, x86 SSE2/SSSE3, POWER8, Generic C): MIN_CHUNKSIZE = sizeof(chunk_t)
  • x86 AVX2: MIN_CHUNKSIZE = sizeof(halfchunk_t) (justified as AVX2 uses 256-bit vectors with halfchunk support)
  • RISC-V: MIN_CHUNKSIZE = CHUNK_SIZE (uses a different but equivalent approach)

Each architecture appropriately defines its chunk_t type based on its vector capabilities:

  • ARM NEON: uint8x16_t (128-bit)
  • x86 SSE2/SSSE3: __m128i (128-bit)
  • x86 AVX2/AVX512: __m256i (256-bit)
  • POWER8: vector unsigned char
  • Generic C: uint64_t
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check MIN_CHUNKSIZE definitions across different architectures
# Expected: All architecture-specific implementations should define MIN_CHUNKSIZE 
# consistently relative to their vector sizes

echo "Checking MIN_CHUNKSIZE definitions across architectures..."
rg "MIN_CHUNKSIZE" arch/*/chunk*.{c,h} -B 2 -A 2

echo -e "\nChecking chunk_t definitions..."
rg "typedef.*chunk_t" arch/*/chunk*.{c,h} -B 1 -A 1

Length of output: 3647

chunkset_tpl.h (2)

Line range hint 253-259: LGTM: Safe fallback for small buffer sizes

The check for MIN_CHUNKSIZE provides a safe fallback to byte-by-byte copy when the remaining buffer space is too small for chunk operations. This aligns with the PR objective of controlling chunk operation dispatch while maintaining safety.

Let's verify the definition of MIN_CHUNKSIZE across architectures:

✅ Verification successful

Let me verify the definition of chunk_t and halfchunk_t to ensure the MIN_CHUNKSIZE values are consistent with the architecture-specific requirements.


MIN_CHUNKSIZE check is correctly defined and used across architectures

The MIN_CHUNKSIZE check in the fallback logic is correctly implemented, with architecture-specific definitions that match their respective chunk_t sizes:

  • x86 SSE2/SSSE3: 16 bytes (size of __m128i)
  • x86 AVX2: 16 bytes (size of halfchunk_t/__m128i)
  • ARM NEON: 16 bytes (size of uint8x16_t)
  • Generic: 8 bytes (size of uint64_t)
  • POWER8: 16 bytes (size of vector)
  • RISC-V: 32 bytes (explicitly defined)

The fallback to byte-by-byte copy when buffer size is less than MIN_CHUNKSIZE is a safe and correct approach that maintains compatibility across all supported architectures.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check MIN_CHUNKSIZE definitions across architecture-specific files
# Expect: MIN_CHUNKSIZE should be defined consistently based on chunk_t or halfchunk_t

rg -A 2 "define.*MIN_CHUNKSIZE" arch/

Length of output: 1287


Script:

#!/bin/bash
# Check chunk_t and halfchunk_t definitions
rg -A 2 "typedef.*chunk_t|typedef.*halfchunk_t" --type c --type h

# Check CHUNK_SIZE definition for RISC-V
rg "define.*CHUNK_SIZE" arch/riscv/

Length of output: 1294


Line range hint 279-285: LGTM: Comprehensive safety checks for copy operations

The dual check against MIN_CHUNKSIZE for both from_dist and safelen ensures safe operation by falling back to byte-by-byte copy when either:

  1. The source and destination are too close (from_dist < MIN_CHUNKSIZE)
  2. The safe buffer space is too small (safelen < MIN_CHUNKSIZE)

This maintains safety while allowing more cases to fall through to CHUNKCOPY when conditions permit.

Let's verify the relationship between these checks and the chunk operations:

✅ Verification successful

Based on the search results, I can now verify the relationship between the safety checks and chunk operations. The codebase shows:

  1. The MIN_CHUNKSIZE check is consistently defined and used across different architectures (SSE2, AVX2, AVX512, NEON, Power8, RVV) to ensure safe chunk operations.

  2. The dual check against MIN_CHUNKSIZE for both from_dist and safelen is used in the safe copy operations (CHUNKCOPY_SAFE) to prevent buffer overruns by falling back to byte-by-byte copy when:

    • The source and destination are too close (from_dist < MIN_CHUNKSIZE)
    • The safe buffer space is too small (safelen < MIN_CHUNKSIZE)
  3. This safety mechanism is consistently implemented across all architecture-specific implementations, with each having its own optimized chunk size but maintaining the same safety checks.

LGTM: The safety checks are correctly implemented and essential for safe chunk operations

The dual check against MIN_CHUNKSIZE is a critical safety mechanism that ensures:

  • Safe fallback to byte-by-byte copy when buffers are too close
  • Prevention of buffer overruns by checking available space
  • Consistent safety across all architecture-specific implementations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential chunk size assumptions in the codebase
# Expect: No unsafe assumptions about minimum distances or buffer sizes

# Look for related size checks and chunk operations
rg -A 5 "chunk_t|halfchunk_t|CHUNKCOPY|CHUNKMEMSET"

# Look for potential safety-related comments or assertions
rg -A 2 "safe|assert.*chunk"

Length of output: 70984

#define HAVE_CHUNKMEMSET_2
#define HAVE_CHUNKMEMSET_4
#define HAVE_CHUNKMEMSET_8
#define MIN_CHUNKSIZE sizeof(chunk_t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

The MIN_CHUNKSIZE definition is inconsistent with AVX2/AVX512 implementations

The verification revealed that while MIN_CHUNKSIZE is consistently defined as sizeof(chunk_t) across architectures, the underlying chunk_t type varies:

  • SSE2/SSSE3: __m128i (16 bytes)
  • AVX2/AVX512: __m256i (32 bytes)
  • NEON: uint8x16_t (16 bytes)
  • POWER8: vector unsigned char (16 bytes)
  • Generic: uint64_t (8 bytes)

This inconsistency in chunk_t sizes means the SSE2 implementation's MIN_CHUNKSIZE (16 bytes) differs from AVX2/AVX512 (32 bytes), which could lead to suboptimal performance or incorrect behavior when switching between implementations.

  • Consider aligning MIN_CHUNKSIZE with the largest chunk size (32 bytes) across all x86 implementations
  • Review if the template code in chunkset_tpl.h correctly handles these size differences
🔗 Analysis chain

LGTM! The MIN_CHUNKSIZE definition is correct and well-placed.

The macro definition correctly uses sizeof(chunk_t) which evaluates to 16 bytes (size of __m128i). This aligns with SSE2's natural vector size and matches the PR's objective of standardizing chunk size handling across architectures.

Let's verify the consistency of MIN_CHUNKSIZE across other architecture files:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify MIN_CHUNKSIZE definition consistency across architecture files
# Expected: All architecture files should define MIN_CHUNKSIZE similarly

# Search for MIN_CHUNKSIZE definitions in architecture files
echo "Checking MIN_CHUNKSIZE definitions:"
rg "MIN_CHUNKSIZE.*sizeof\(chunk_t\)" arch/

# Verify chunk_t definitions to ensure correct sizes
echo -e "\nChecking chunk_t definitions:"
rg "typedef.*chunk_t" arch/

Length of output: 1037

#define HAVE_CHUNKMEMSET_2
#define HAVE_CHUNKMEMSET_4
#define HAVE_CHUNKMEMSET_8
#define MIN_CHUNKSIZE CHUNK_SIZE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Inconsistent chunk size definition needs attention

The RISC-V implementation defines CHUNK_SIZE as 32 and MIN_CHUNKSIZE as CHUNK_SIZE, but this differs from other architectures where:

  • Generic C: 8 bytes (uint64_t)
  • x86 (SSE2/SSSE3): 16 bytes (__m128i)
  • x86 (AVX2): 32 bytes (__m256i)
  • ARM NEON: 16 bytes (uint8x16_t)
  • POWER8: 16 bytes (vector unsigned char)

The RISC-V implementation uses a struct with 4 uint64_t (32 bytes) because "We don't have a 32-byte datatype for RISC-V arch". While the size is correctly matched to the struct, this inconsistency with other architectures' MIN_CHUNKSIZE definitions (which use sizeof(chunk_t)) should be reviewed.

🔗 Analysis chain

LGTM! Verify chunk size consistency.

The MIN_CHUNKSIZE macro definition correctly matches the CHUNK_SIZE of 32 bytes, which aligns with the chunk_t structure size (4 × uint64_t).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify consistent chunk size definitions across architecture-specific files
rg -g 'arch/*/*.c' -g 'arch/*/*.h' "MIN_CHUNKSIZE|CHUNK_SIZE" 

Length of output: 901


Script:

#!/bin/bash
# Let's verify the chunk_t definitions to understand the size variations
rg -g 'arch/*/*.h' -g 'arch/*/*.c' "typedef.*chunk_t" -A 3

Length of output: 1436


Script:

#!/bin/bash
# Let's check the RISC-V specific chunk_t definition and structure
rg -g 'arch/riscv/*' "chunk_t|struct chunk" -B 2 -A 4

Length of output: 3430

@KungFuJesus
Copy link
Contributor Author

Hmm, actually I see one case where this may not be true...I could remedy it but the cure might be worse than the disease.

@codecov
Copy link

codecov bot commented Nov 29, 2024

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 32.18%. Comparing base (0ed5ac8) to head (cc181a2).
Report is 40 commits behind head on develop.

Files with missing lines Patch % Lines
chunkset_tpl.h 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #1819   +/-   ##
========================================
  Coverage    32.18%   32.18%           
========================================
  Files           67       67           
  Lines         5752     5752           
  Branches      1237     1237           
========================================
  Hits          1851     1851           
  Misses        3644     3644           
  Partials       257      257           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Dead2 Dead2 added the Rebase needed Please do a 'git rebase develop yourbranch' label Dec 23, 2024
@Dead2
Copy link
Member

Dead2 commented Jan 26, 2025

Develop

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%      0.102/0.105/0.107/0.001        0.028/0.033/0.035/0.002        8,526,745
 2     43.871%      0.200/0.203/0.205/0.001        0.031/0.034/0.035/0.001        6,903,702
 3     42.390%      0.238/0.243/0.245/0.002        0.028/0.031/0.033/0.002        6,670,664
 4     41.644%      0.269/0.273/0.276/0.002        0.028/0.031/0.032/0.001        6,553,205
 5     41.215%      0.295/0.300/0.303/0.002        0.026/0.031/0.032/0.001        6,485,659
 6     41.032%      0.341/0.348/0.352/0.003        0.025/0.030/0.032/0.002        6,456,912
 7     40.778%      0.499/0.507/0.512/0.003        0.030/0.031/0.032/0.001        6,416,941
 8     40.704%      0.616/0.624/0.628/0.004        0.027/0.031/0.033/0.002        6,405,249
 9     40.409%      0.898/0.907/0.912/0.003        0.026/0.030/0.032/0.002        6,358,951

 avg1  42.914%                        0.390                          0.031
 tot                                105.317                          8.446       60,778,028

   text    data     bss     dec     hex filename
 135432    1344       8  136784   21650 libz-ng.so.2

          5,354.54 msec task-clock:u                     #    1.000 CPUs utilized               ( +-  0.02% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               145      page-faults:u                    #   27.080 /sec                        ( +-  0.39% )
    18,474,830,469      cycles:u                         #    3.450 GHz                         ( +-  0.02% )  (62.50%)
    32,175,991,605      instructions:u                   #    1.74  insn per cycle              ( +-  0.01% )  (75.01%)
     3,978,430,674      branches:u                       #  743.002 M/sec                       ( +-  0.01% )  (74.99%)
       173,811,583      branch-misses:u                  #    4.37% of all branches             ( +-  0.02% )  (74.99%)
     6,643,476,692      L1-dcache-loads:u                #    1.241 G/sec                       ( +-  0.01% )  (74.99%)
       232,274,443      L1-dcache-load-misses:u          #    3.50% of all L1-dcache accesses   ( +-  0.05% )  (75.01%)
        18,313,927      LLC-loads:u                      #    3.420 M/sec                       ( +-  0.81% )  (50.02%)
               145      LLC-load-misses:u                #    0.00% of all LL-cache accesses    ( +- 30.38% )  (50.00%)

#1819 Allow more cases to fall through to CHUNKMEMSET/CHUNKCOPY

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%      0.101/0.105/0.107/0.001        0.027/0.033/0.035/0.002        8,526,745
 2     43.871%      0.199/0.203/0.205/0.002        0.031/0.033/0.034/0.001        6,903,702
 3     42.390%      0.238/0.242/0.244/0.002        0.025/0.031/0.033/0.002        6,670,664
 4     41.644%      0.268/0.272/0.275/0.002        0.028/0.031/0.033/0.001        6,553,205
 5     41.215%      0.291/0.298/0.301/0.002        0.028/0.031/0.033/0.001        6,485,659
 6     41.032%      0.340/0.347/0.351/0.003        0.025/0.030/0.033/0.002        6,456,912
 7     40.778%      0.498/0.506/0.510/0.003        0.029/0.031/0.032/0.001        6,416,941
 8     40.704%      0.609/0.620/0.625/0.004        0.028/0.030/0.032/0.001        6,405,249
 9     40.409%      0.898/0.906/0.911/0.004        0.022/0.030/0.032/0.002        6,358,951

 avg1  42.914%                        0.389                          0.031
 tot                                105.017                          8.428       60,778,028

   text    data     bss     dec     hex filename
 135432    1344       8  136784   21650 libz-ng.so.2

          5,341.35 msec task-clock:u                     #    1.000 CPUs utilized               ( +-  0.02% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               145      page-faults:u                    #   27.147 /sec                        ( +-  0.31% )
    18,447,534,193      cycles:u                         #    3.454 GHz                         ( +-  0.01% )  (62.48%)
    32,175,474,860      instructions:u                   #    1.74  insn per cycle              ( +-  0.01% )  (74.98%)
     3,978,256,810      branches:u                       #  744.803 M/sec                       ( +-  0.01% )  (74.98%)
       173,535,731      branch-misses:u                  #    4.36% of all branches             ( +-  0.02% )  (75.00%)
     6,643,604,513      L1-dcache-loads:u                #    1.244 G/sec                       ( +-  0.01% )  (75.02%)
       233,118,837      L1-dcache-load-misses:u          #    3.51% of all L1-dcache accesses   ( +-  0.05% )  (75.02%)
        16,988,827      LLC-loads:u                      #    3.181 M/sec                       ( +-  1.24% )  (50.00%)
               162      LLC-load-misses:u                #    0.00% of all LL-cache accesses    ( +- 25.11% )  (49.98%)

It is not much, but it seems to be slightly faster in my AVX2 tests at least.

@KungFuJesus
Copy link
Contributor Author

For this to handle all the possible corner cases it'd be a bit messy (way messier than it is with AVX512 where a simple mask can be used unconditionally). It's probably not worth the bug surface area to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimization Rebase needed Please do a 'git rebase develop yourbranch'

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants