Sorting algorithms used in the CL implementations
Which sorting algorithm should one implement when developing a program? The best answer is probably none. Use the sort provided by your system/library/etc. Unless you know your input data has some special properties that you can take advantage of, the provided sort should be enough for your needs and probably is more efficiently implemented.
However, I think it is important to know what sorting algorithm is implemented. If one knows the properties of the data, it is possible to understand if the provided sort can or will pose a problem. In the same way a programmer shouldn’t implement a sorting algorithm every time it needs to sort something, the programmer should also be aware of the limitations/advantages of the system sort. That way one can decide if a special sort is needed or not.
Common Lisp provides the functions sort and stable-sort. The HyperSpec describes their operation well but it does not define the sorting algorithm. That decision is left free to the implementations. In addition, both functions don’t necessarily share the same algorithm. The difference between the two is that the second function sorts in a way that guarantees stability, i.e., two elements that are equal remain in the same position after sorting is completed. The use of sort and stable-sort requires some care (see the section sort pitfalls) but lets focus on the algorithms and not on its usage.
What sorting algorithms do the major open source CL implementations actually implement? I was curious about it and went to check the source for ABCL, CCL, CLISP, CMUCL, ECL and SBCL. Not surprising, we find some differences between the implementations. What it was more unexpected to discover is that some implementations also use different sorting algorithms according to the sequence type. A quick survey of the findings is summarized in the following table (if anythings is incorrect, please tell me). The links for the source code are in the implementation name (careful, in CCL and SBCL there are two links).
|ABCL||merge sort (lists) / quicksort||merge sort|
|CCL||merge sort (lists) / quicksort||merge sort|
|CLISP||tree sort||tree sort|
|ECL||merge sort (lists) / quicksort||quicksort (strings + bit vectors) / merge sort|
|SBCL||merge sort (lists) / heapsort||merge sort|
In terms of the implementation of sort, quicksort is the most used algorithm, followed by heapsort. The choice for these algorithms is expected. Both have an average-case performance of O(nlgn) and heapsort guarantees a worst-case performace of O(nlgn) too. Quicksort has a worst-case performance of O(n2) but it can be optimized in several ways so that it also gives an expected worst-case performance of O(nlgn). However, it seems that the quicksort implementations are not completely optimized. In ECL (and ABCL) quicksort implements a partition scheme which deals better with duplicate elements (although is not the three-way partitioning) but it always picks as pivot the first element. CCL chooses the pivot with a median-of-3 method and always sorts the smaller partition to ensure a worst-case stack depth of O(lgn).
As for CLISP, I think it uses a tree sort but I am not entirely sure. The only source file I could find with a sort implementation was sort.d and it looks like it contains an implementation of tree sort with a self-balanced binary tree, which also gives this algorithm an average and worst-case performance of O(nlgn).
As expected, most of the implementations use merge sort to implement stable-sort since it is a stable sort with average and worst-case performance of O(nlgn). Apparently, all implementations are bottom-up merge sorts with the exception of CCL and ECL. Another interesting thing is that merge sort is also used for lists in sort, in most of the implementations. However, I found it surprising to find quicksort in the stable-sort column because it is not a stable algorithm. Since it is only used for strings and bit vectors, it is not really an issue. While reading the source code of the implementations, I realized that ABCL was using quicksort in stable-sort for all non-list sequences. This is a problem that exists in the current 1.0.1 release but I’ve sent a bug report with a quick fix to the maintainers. The next release should have stable-sort fixed.
This exploration of the sorting algorithms used in the open source implementations was very educational and interesting to me. I’ve learned what algorithms are actually used and enjoyed seing how they were implemented. Just spotting the issue in ABCL stable-sort made this review worthwhile. I think there is still room for improvement in some implementations but knowing now the strengths and weaknesses of the sorts in CL is already good enough. On a final note, I just wonder what are the algorithms used in ACL and LW.