Database statistics describe and give information about the data within a database. When the data is retrieved from the database via a query, the query optimizer generates an execution plan for the query based on the information that is given in the database statistics. There are different types of object statistics (number of rows of the table, number of blocks of the table, number of distinct values of a column, histograms, and so on). Therefore, the statistics should be fresh and should describe the data correctly, such that the query optimizer will have enough good information to produce the desired optimal execution plan. But keeping statistics fresh and of high quality is not easy. This is because our data is increasing rapidly, and a database usually gathers statistics via jobs at night. Also, gathering statistics for very large tables requires lots of resources (CPU, I/O, and time). Over time, the process of statistics gathering, the information kept in the statistics, and the methods that describe our data have all been improved.
In computer science, finding distinct elements in a data stream has been studied for a long time and is known as the “count-distinct” problem (https://en.wikipedia.org/wiki/Count-distinct_problem). Oracle has implemented the suggested algorithms to compute the Number of Distinct Values. In this article, we will see how Oracle database (11g, 12c r2) gathers important object statistics: the Number of Distinct Values of the column.
(Editor's note: This article contains a few mathematical symbols; for best legibility it is attached as a PDF.)