Performance
This package uses the GSON Java library for comprehensive JSON parsing, however this can have a performance impact when working with queries that return large responses. If response times are slow and bandwidth is not the issue the following should be considered:
In general only KQL queries should return sufficient volumes of data to exhibit query times that dominate the inherent communication time.
Return the minimum amount of data.
Where convenient to do so structure queries to perform data formatting and filtering on the server side. This takes advantage of the proximity to the data and the scale of the Azure Data Explorer platform.
Ingesting from files or tables uses binary transfers, typically with optimized binary parquet files, for larger volumes of data this is far more efficient than repeated inline ingestion of small amounts of data.
Simple JSON responses may be parsed faster using a custom row decoder, see below for more details.
Return int32s rather than int64s if precision is not a limitation.
Avoid supporting nulls data if data is know to not have null values. See: NullData for more details.
The MATLAB profiler can be a good way to visualize what portions of query processing are consuming most time.
If responses contain dynamic fields where not all values returned may be used consider decoding them at the time of use as needed rather than as part of the query.
If getting direct access to the raw data return in response is required, this can be accomplished using the lower-level functions.
Parallel Processing
If Parallel Computing Toolbox is installed (use ver
to check), then it can be
used to speed up the processing of KQL query rows. Use the optional useParallel
argument to enable this (default is false
). Additionally a threshold of rows
can be applied below which Parallel processing will not be used. The default is 1000.
The best value for this threshold will depend on the content of the rows and whether
repeated large row count calls are made, some experimentation may be required.
This can also be used with custom row decoders. The default row decoder requires
a process based parpool.
Here a parallel processing is enabled along with a custom row decoder, parallelism is applied for queries returning 10000 rows or more.
Example:
h = @mathworks.internal.adx.exampleCustomRowDecoder;
[result, success] = mathworks.adx.run(query, useParallel=true, parallelThreshold=10000, customRowDecoder=h);
Custom row decoder
If queries result in a simple JSON response then writing a custom decoder to extract the required data rather than using the default decoder.
A sample of such a decoder /Software/MATLAB/app/system/+mathworks/+internal/+adx/exampleCustomRowDecoder.m
is provided. This function handles an array of rows of the form: [128030544,1.0,"myStringValue-1"]
with fields of types int64, double and a string.
It will not process other data and is intended for speed rather than strict correctness or robustness.
The custom decoder is applied to the PrimaryResult rows field only.
It is required to return a cell array of size number of rows by the number of columns that can be converted to a MATLAB table with the given schema.
It is not required to respect input arguments flags if foreknowledge of returned data permits it.
Custom row decoders can be applied to progressive and nonprogressive KQL API v2 and KQL API v1 mode queries.
When a custom decoder is not used the generic decoder mathworks.internal.adxgetRowsWithSchema
is used.
A function handle
is used to pass a handle to the custom row decoder to the run
or KQLquery
commands.
Example:
query = sprintf('table("%s", "all")', tableName);
crd = @mathworks.internal.adx.exampleCustomRowDecoder;
[result, success] = mathworks.adx.run(query, customRowDecoder=crd);
The exampleCustomRowDecoder
example implements a Parallel Computing based parfor
based parallelization, this is not required but may be helpful.
Depending on the nature of the decoder code a process based pool may be required rather than a thread based pool. It is unlikely that a decoding process would benefit from a remote processing via a cluster based pool but this would be possible.
doNotParse parallel array processing (Experimental)
A JSON array of doNotParse
values being processed by JSONMapper
must be checked
for paired double quotes added to some value types the gson toString
method.
While trivial this can be slow if there are many elements for example, row values.
An experimental flag (useParallel
) can be set to true to enable parallel
processing of this step using parfor
if Parallel Computing Toolbox is available.
The property can be set in: /Software/MATLAB/app/system/+adx/+control/JSONMapper.m
in the fromJSON
method.
Skip parsing row array elements (skipRowsArrayDecode)
The following applies to v2 queries only.
While the Rows array elements are not parsed by the the generation of a adx.data.models.QueryV2ResponseRaw
in a adx.data.api.Query.queryRun
call prior to the generation of a MATLAB table,
as done by the higher-level mathworks.adx.KQLQuery
function, the array
itself is parsed.
If the optional named argument skipRowsArrayDecode
is used with a adx.data.api.Query.queryRun
call then the frames are parsed but the Rows array itself is not. This enables
If parsing the Rows independently if needed in a performance optimal way.
Example:
% Create a request
request = adx.data.models.QueryRequest();
request.csl = "mytablename | take 100";
% Create the Query object and run the request
query = adx.data.api.Query();
[code, result, response, id] = query.queryRun(request, apiVersion="v2", skipRowsArrayDecode=true);