I have a tensor - batch of matrixes dims [10 x 6 x 52] 10 matrixes 6 * 52 raw major. I can change batch size as I want. Data type is - single float. And I need to normalize every matrix in the tensor by it columns sum(so sum will be a vector of length 52). So I need make a columnwise sum and devide every row in matrix to it. A pretty typical task in different areas. Currently, I am doing something like this:
//[10 x 6 x 52] - [batch x actions x cards_count]
// node.regrets is target and source tensor. node.regrets_sum - storage for sum.
const size_t actionsCount = node.ActionsCount();
for (long long b = 0; b < _batch_size; b++)
{
memset(node.regrets_sum.data(), 0, node.regrets_sum.size() * sizeof(float));
for (size_t a = 0; a < actionsCount; a++)
{
const size_t regretsOffset = (b * actionsCount + a) * cards_count;
vsAdd(card_count, node.regrets_sum.data(), node.regrets.data() + regretsOffset, node.regrets_sum.data());
}
for (size_t a = 0; a < actionsCount; a++)
{
const size_t regretsOffset = (b * actionsCount + a) * cards_count;
vsDiv(card_count, node.regrets.data() + regretsOffset, node.regrets_sum.data(), node.regrets.data() + regretsOffset);
}
}
And this is the hottest point of my app. I am pretty sure that performance can be improved because currently by the profiling I know that gemm with this tensor is faster than this normalization. Any ideas how to optimize this with help of MKL and Intel compiler? Maybe I have missed some ready to use routine for this case. Thank you in advance!