Interface SIMDCompositeExpression
- All Superinterfaces:
FastCompositeExpression,Savable,Serializable
- All Known Implementing Classes:
BulkTurboEvaluator.BatchedVectorCompositeExpression
Super-fast, SIMD-aligned vector expression evaluator interface.
Exposes a dual-path API optimized for massive parallel time-series calculation,
machine learning transformations, and fused token attention operations.
* @author GBEMIRO
-
Method Summary
Modifier and TypeMethodDescriptionvoidapplyBulk(double[][] variables, double[] output, boolean useBlocks) Executes the compiled expression over a convenient 2D array of variables.voidapplyBulk(double[] flatVariables, double[] output, boolean useBlocks) Warp Speed Path (Power Users): Evaluates a single, raw, pre-grouped flat array at maximum hardware throughput with zero memory copies, zero allocations, and direct sequential prefetching.voidapplyBulkBatched(double[][] variables, double[] output, int batchSize, boolean useBlocks) Evaluates a 2D variable structure using an explicit, cache-bounded window chunk size to enforce strict CPU L1/L2 data cache localization.voidapplyBulkBatched(double[] flatVariables, double[] output, int batchSize, boolean useBlocks) Warp Speed Path (Power Users): Evaluates a pre-grouped flat array using custom-defined batch chunk boundaries to maximize custom hardware L1/L2 execution efficiency.voidapplyBulkParallel(double[][] variables, double[] output) Distributes the evaluation of a 2D array dataset evenly across multiple processing threads using a fork-join block chunking methodology.voidapplyBulkParallel(double[] flatVariables, double[] output) Warp Speed Path (Power Users): Concurrently evaluates a pre-grouped flat array by cleanly dividing segments across active CPU cores for optimal multi-threaded memory bandwidth consumption.voidapplyMatrixKernel(FlatMatrix[] inputs, FlatMatrix output, String operation) Fuses deep learning and high-performance neural network transformations directly over pre-allocated double-precision structural tensor types.voidapplyMatrixKernel(FlatMatrixF[] inputs, FlatMatrixF output, String op) Fuses deep learning and high-performance neural network transformations directly over pre-allocated single-precision (float) tensor execution spaces.Methods inherited from interface com.github.gbenroscience.parser.turbo.tools.FastCompositeExpression
apply, applyMatrix, applyScalar, applyString, applyVector, checkErrorLogs, getCompiler, getRoot
-
Method Details
-
applyBulk
void applyBulk(double[][] variables, double[] output, boolean useBlocks) Executes the compiled expression over a convenient 2D array of variables. Under the hood, this method flattens and groups data into sequential memory blocks before delegating to the core vectorized execution kernel.- Parameters:
variables- A 2D array of variable channels where each outer row represents an entire vector for a specific variable slot, e.g.,[[x1, x2... xn], [y1, y2... yn], [z1, z2... zn]].output- The pre-allocated target array where the final evaluated results will be directly dumped.useBlocks- Iftrue, processes data in L1/L2 cache-bounded blocks to prevent memory thrashing on large datasets. Iffalse, executes a raw scalar sequential stream maximizing clock throughput on low-element datasets.
-
applyBulkParallel
void applyBulkParallel(double[][] variables, double[] output) Distributes the evaluation of a 2D array dataset evenly across multiple processing threads using a fork-join block chunking methodology.- Parameters:
variables- A 2D array of variable channels where each outer row represents an entire vector for a specific variable slot, e.g.,[[x1, x2... xn], [y1, y2... yn]].output- The pre-allocated target array where the parallel calculations will be written.
-
applyBulkBatched
void applyBulkBatched(double[][] variables, double[] output, int batchSize, boolean useBlocks) Evaluates a 2D variable structure using an explicit, cache-bounded window chunk size to enforce strict CPU L1/L2 data cache localization.- Parameters:
variables- A 2D array of variable channels where each outer row represents an entire vector for a specific variable slot, e.g.,[[x1, x2... xn], [y1, y2... yn]].output- The pre-allocated target array where the evaluated blocks will be written.batchSize- The strict memory segment window slice size processed per individual cache loop pass.useBlocks- Iftrue, overlays internal sub-tiling structures on top of the batch slice boundaries. Iffalse, evaluates the raw batch length sequentially in a single pass.
-
applyBulk
void applyBulk(double[] flatVariables, double[] output, boolean useBlocks) Warp Speed Path (Power Users): Evaluates a single, raw, pre-grouped flat array at maximum hardware throughput with zero memory copies, zero allocations, and direct sequential prefetching.- Parameters:
flatVariables- A single flat array containing variables contiguously grouped back-to-back by variable slot. CRITICAL: Data must use a Grouped structure, e.g.,[x1, x2... xn, y1, y2... yn, z1, z2... zn]. Interleaved arrays will yield corrupted data.output- The pre-allocated target array where the evaluated vector stream is directly copied.useBlocks- Iftrue, maps memory segments into cache-sized micro-tiles. Iffalse, allows the CPU prefetcher to step uninhibited sequentially through the flat segments.
-
applyBulkParallel
void applyBulkParallel(double[] flatVariables, double[] output) Warp Speed Path (Power Users): Concurrently evaluates a pre-grouped flat array by cleanly dividing segments across active CPU cores for optimal multi-threaded memory bandwidth consumption.- Parameters:
flatVariables- A single flat array containing variables contiguously grouped back-to-back by variable slot. CRITICAL: Data must use a Grouped structure, e.g.,[x1, x2... xn, y1, y2... yn, z1, z2... zn]. Do NOT pass interleaved data.output- The pre-allocated target array where parallel workers will drop evaluated computations.
-
applyBulkBatched
void applyBulkBatched(double[] flatVariables, double[] output, int batchSize, boolean useBlocks) Warp Speed Path (Power Users): Evaluates a pre-grouped flat array using custom-defined batch chunk boundaries to maximize custom hardware L1/L2 execution efficiency.- Parameters:
flatVariables- A single flat array containing variables contiguously grouped back-to-back by variable slot. CRITICAL: Data must use a Grouped structure, e.g.,[x1, x2... xn, y1, y2... yn, z1, z2... zn]. Do NOT pass interleaved data.output- The pre-allocated target array where the evaluated blocks will be written.batchSize- The localized processing block window length applied across individual memory loops.useBlocks- Iftrue, enforces inner tiling within the specified batch limits. Iffalse, trusts the user-provided batch size as the definitive sequential block length.
-
applyMatrixKernel
Fuses deep learning and high-performance neural network transformations directly over pre-allocated double-precision structural tensor types.- Parameters:
inputs- An ordered array of operational tensor matrices, e.g., weights, targets, scales, or vector biases.output- The destination matrix wrapper initialized to capture transformed weights or spatial dimensions.operation- The explicit execution identifier targeting underlying matrix optimization layers (e.g., "matmul", "rms_norm", "swiglu", "q8_quantize").
-
applyMatrixKernel
Fuses deep learning and high-performance neural network transformations directly over pre-allocated single-precision (float) tensor execution spaces.- Parameters:
inputs- An ordered array of float operational matrices, optimized for lightning-fast transformer workloads.output- The destination single-precision tensor buffer designed to safely lock down mutated states.op- The explicit execution identifier targeting underlying matrix optimization layers (e.g., "matmul_bias_gelu", "rope_split", "mha_attention").
-