We investigated and found why adding an index on the grouping field above harms performance. This is the case of the query engine currently being unable to select an optimal strategy out of multiple competing strategies.
When we run SELECT ... WHERE ... GROUP ... on a table where both WHERE and GROUP can be optimized but using different indexes, the query engine has to choose whether to (a) do WHERE using the index it hits, reading only a subset of records from the table but losing the ordering from the second index, then do GROUP on the resulting records re-creating the ordering (the second index is not used), or (b) do GROUP using the ordering from the second index, and filter records using WHERE when reading them (the first index is not used). That's a simplified picture, but it is close enough.
Which strategy is going to be faster depends on how selective WHERE is. If it throws away a lot of the records, (a) wins. If it does not throw away enough records, (b) wins. Without statistics, we cannot know whether WHERE is selective enough, so we currently end up being conservative and select (b), because that's safest.
A simple example.
Create a new MAP file. Open a command window. Run the following:
CREATE TABLE t (
v INT32, INDEX v_x BTREEDUP (v),
g INT32, INDEX g_x BTREEDUP (g)
INSERT INTO t (v, g)
SELECT [value] MOD 10, [value] MOD 100 FROM CALL ValueSequence(1, 100000, 1);
This creates a test table with 100,000 records, 2 fields, a BTREEDUP index on each field.
Run these queries one by one and observe processing times in the log (I put processing times from my tests into the text):
SELECT g, Sum(v) INTO u1 FROM t WHERE v = 1
GROUP BY g; -- 0.263
SELECT g, Sum(v) INTO u2 FROM t WHERE v IN (1, 2, 3, 4, 5, 6, 7, 8)
GROUP BY g; -- 0.354
SELECT Sum(v) INTO u3 FROM t WHERE v = 1
GROUP BY g+1; -- 0.104
SELECT Sum(v) INTO u4 FROM t WHERE v IN (1, 2, 3, 4, 5, 6, 7, 8)
GROUP BY g+1; -- 0.930
All queries are SELECT ... INTO to force reading the whole table.
The first query allows using indexes in both WHERE and GROUP, the query engine chooses to use the index in GROUP, WHERE selects 10% of all records and the query completes in 0.263 sec. It would make more sense to use the index in WHERE because that would reduce the number of processed records so much, but the query engine does not know that WHERE is this good and goes the safest route.
The second query also allows using indexes in both WHERE and GROUP, the query engine again chooses to use the index in GROUP, WHERE selects 80% of all records and the query completes in 0.354 sec. You can see that while the number of records passed by WHERE increased a lot, the processing time didn't grow much. Using the index in GROUP is the right strategy here.
The third query groups by an expression and so disallows using an index in GROUP. The query engine now turns to use the index in WHERE, WHERE selects 10% of all records (great) and the query completes in 0.104 sec. This is noticeably faster than the first query. Using the index in WHERE is the right strategy for this query and it pays off.
The fourth query also groups by an expression and disallows using an index in GROUP. However, WHERE is not selective enough and keeps most of the records (80%), so not being able to use an index in GROUP costs significant time. The query completes in 0.930 sec, way slower than query 2. Using the index in WHERE is the wrong strategy here.
In sum, adding an index can sometimes make a query slower. We aren't alone in this, it happens with other databases as well. If you care about performance, then after you make any changes to a table, you are supposed to re-test all queries that end up using that table to both see whether they still work and also to see how they now perform, that's not news. However, we agree we can help here. In general, we want to do two things. First, let the query engine make better use of field statistics so that it can make better judgement when making choices. Second, allow seeing what the query engine decided to do. Finally, we already have ample means to influence the decisions of the query engine with temporary tables and temporary databases, but we might add a couple more constructs / functions for that as well.