KEEP CALM and SORT it out

12 minute read

When I first started learning algorithms, I always had this question: why sorting algorithms are so important that every professor couldn’t stop talking about it? Apart from rearranging all elements using a given operators, what else does it do? And is it occasionally important to choose the right sorting algorithms?

And the answer is YES.

Sorting algorithms are critical to today’s internet is because it helps with the efficiency and searching performance, especially when the amount of data is rapidly growing. Just imagine you have 10 million of unique, randomly distributed integers and you gotta find the number equal to the given value, what should you do? (THOUGHTS)

As I quote from this coder:

It’s the same reason people seem too interested in factorial when you start talking about recursion: it’s a thing you already know how to do, but now you’re going to look at it a new way. They’re showing how different types of algorithms can make a big difference in performance. They’re showing how to calculate the growth rate of time required as the number of items grows large.

so let’s keep calm and SORT this out.

Quick Sort

  1. choose a pivot
  2. rearrange arrays, elements smaller than pivot goes to left and the rest goes to right.
  3. recursive untill gets the answer.

Quick sort is one of the most important and popular sorting algorithms. It’s also used in the sort() method of many languages. java.util.Arrays uses quick sort(dual pivot quick sort) in the most recent versions for primitive sort like int and merge sort for objects that implement Comparable or use a Comparator(because quicksort is not stable. e.g.: equal entries can change their relative position during sort so if we sort a sorted array it may not stay unchanged. Because primitive types have no ientity so there’s no way to distinguish two ints with same value but for reference type this could casue problems. And why primitives not use merge sort is because it requires a lone of the array.).

Quick Sort cannot work with large datasets.

  • Average: O(nlogn)
  • Space: O(logn)
  • Best: O(O(nlogn))
  • Worst: O(n^2): This usually depends on which pivot you choose but most of time the leftmost/rightmost pivot is chosen. Then in following cases we can get worst performance:
  1. Array is already sorted.
  2. All elements are the same(the edge case of 1). So to solve this, either choose randomly or choose the median of three(first, median, last) as the pivot.

a. recursion

In-place, no need to return.

def quick_sort(array, l, r):
    if l < r:
        q = partition(array, l, r)
        quick_sort(array, l, q - 1)
        quick_sort(array, q + 1, r)

def partition(array, l, r):
    x = array[r]
    i = l - 1
    for j in range(l, r):
        if array[j] <= x:
            i += 1
            array[i], array[j] = array[j], array[i]
    array[i + 1], array[r] = array[r], array[i+1]
    return i + 1

b. iteration using stack

Usually stack has two meanings: a LIFO data structure and stack-based memory allocation(whose structure is still stack). In assembly language, when we call a function, we utilmately change the stack pointer ESPextended stack pointer(issues instructions like POP, PUSH, CALL…) that stored on the top of stack. And another register EBPextended base pointer is stored at the bottom and can be moved by prorammers to a particular place in the stack at any time so the data on stack can be read from and written to.

For quick sort, the ending condition for iteration is low == high.

def quick_sort(array, l, r):
    if l >= r:
        return
    stack = []
    stack.append(l)
    stack.append(r)
    while stack:
        low = stack.pop(0)
        high = stack.pop(0)
        if high - low <= 0:
            continue
        x = array[high]
        i = low - 1
        for j in range(low, high):
            if array[j] <= x:
                i += 1
                array[i], array[j] = array[j], array[i]
        array[i + 1], array[high] = array[high], array[i + 1]
        stack.extend([low, i, i + 2, high])

Merge Sort

Like QUick Sort, Merge Sort is also a Divide and Conquer algorithm.

  1. find the middle point to divide: m = len(nums) // 2
  2. call mergeSort for first half: call mergeSort(nums, l, m)
  3. call mergeSort for second half: call mergeSort(nums, m, r)
  • Time: O(nlogn)
  • Space: O(nlogn)
  • Best: O(nlogn)
  • Worst: O(nlogn)

Merge Sort uses additional storage for sorting the auxilary array. It uses three arrays where two are used for storing each half, and the third external one is used to store the final sorted list by merging other two and each array is then sorted recursively.

Merge Sort is a recursive algorithm and the time complexity can be expressed as recurrence relation: T(n) = 2T(n/2) + O(n)

Merge Sort is stable. It can sort linked lists in O(nlogn). Unlike araey, linked list have O(1) insertion and deletion, so we can use merge sort to merge without using extra space.

def merge(left, right):
    res = []
    while left and right:
        if left[0] < right[0]:
            res.append(left.pop(0))
        else:
            res.append(right.pop(0))
    res = res + left + right
    return res

def mergesort(lists):
    if len(lists) <= 1:
        return lists
    mid = len(lists)//2
    left = mergesort(lists[:mid])
    right = mergesort(lists[mid:])
    return merge(left,right)

Tim Sort

Fun fact: The sorted()/sort() in Python is implemented using Tim Sort. As well as for Array.list in Java

  1. divide arrays into blocks known as Run.(The size of run vary from )
  2. Sort runs using insertion sort one by one.
  3. Then merge those using merge in merge sort.
  • Time: O(nlogn)
  • Space: O(n)
  • Best: O(n)
  • Worst: O(nlogn)

Heap Sort

  1. Build a max heap from the input data.
  2. At this point, the largest item is stored at the root of the heap. Replace it with the last item of the heap followed by reducing the size of heap by 1. Finally, heapify the root of tree.
  3. Repeat above steps while size of heap is greater than 1.
  • Average: O(nlogn)
  • Space: O(1)
  • Best: O(nlogn)
  • Worst: O(nlogn)
def fixDown(a,k,n): #自顶向下堆化,从k开始堆化
	N=n-1
	while 2*k<=N:
		j=2*k
		if j<N and a[j]<a[j+1]: #选出左右孩子节点中更大的那个
			j+=1
		if a[k]<a[j]:
			a[k],a[j]=a[j],a[k]
			k=j
		else:
			break
 
def heapSort(l):
	n=len(l)-1
	for i in range(n//2,0,-1):
		fixDown(l,i,len(l))
	while n>1:
		l[1],l[n]=l[n],l[1]
		fixDown(l,1,n)
		n-=1
	return l[1:]
 
l=[-1,26,5,77,1,61,11,59,15,48,19] #第一个元素不用,占位
res=heapSort(l)
print(res)

Bubble Sort

Bubble Sort is the simplest sorting algorithm that works by repeatedly swapping the adjacent elements if they are in wrong order.

  • Average: O(n^2)
  • Space: O(1)
  • Best:O(n), when the array is already sorted.
  • Worst: O(n^2)

Example:
First pass:
( 5 1 4 2 8 ) –> ( 1 5 4 2 8 )
( 1 5 4 2 8 ) –> ( 1 4 5 2 8 )
( 1 4 5 2 8 ) –> ( 1 4 2 5 8 )
( 1 4 2 5 8 ) –> ( 1 4 2 5 8 )
Second pass:
( 1 4 2 5 8 ) –> ( 1 4 2 5 8 )
( 1 4 2 5 8 ) –> ( 1 2 4 5 8 )
( 1 2 4 5 8 ) –> ( 1 2 4 5 8 )
( 1 2 4 5 8 ) –> ( 1 2 4 5 8 )
Third pass(tho already sorted, need one more pass to verify):
……

def bubbleSort(arr):
    n = len(arr)

    for i in range(n):
        swapped = False
        for j in range(0, n-i-1):  # Last is already in place
            # traverse the array from 0 to n-i-1.
            # Swap if greater than the next
            if arr[j] > arr[j+1]:
                arr[j], arr[j+1] = arr[j+1], arr[j]
                swapped = True
        # IF no two elements were swapped
        # by inner loop, then break
        if swapped == False:
            break

Insertion Sort

  • Average: O(n^2)
  • Space: O(1)
  • Best:O(n), when the array is already sorted.
  • Worst: O(n^2)
def insertionSort(arr):
    for i in range(1, len(arr)):
  
        key = arr[i]
        # Move elements of arr[0..i-1], that are
        # greater than key, to one position ahead
        # of their current position
        j = i-1
        while j >= 0 and key < arr[j]:
                arr[j + 1] = arr[j]
                j -= 1
        arr[j + 1] = key

Selection Sort

The selection sort algorithm sorts an array by repeatedly finding the minimum element (considering ascending order) from unsorted part and putting it at the beginning. The algorithm maintains two subarrays in a given array.

  1. The subarray which is already sorted.
  2. Remaining subarray which is unsorted.
  • Average: O(n^2)
  • Space: O(1)
  • Best: O(n^2), when the array is already sorted.
  • Worst: O(n^2)

Example
11 25 12 22 64
11 12 25 22 64
11 12 22 25 64
11 12 22 25 64

for i in range(len(A)):
    min_idx = i
    for j in range(i+1, len(A)):
        if A[min_idx] > A[j]:
            min_idx = j
    A[i], A[min_idx] = A[min_idx], A[i] # swap with first element

Tree Sort

  1. Create a BST by inserting data in the binary search tree.
  2. performance in-order traversal on the tree.
  • Average: O(nlogn)
  • Space: O(n)
  • Best: O(nlogn)
  • Worst: O(n^2)
class SortTree:
  def __init__(self, value):
    self.left = None
    self.value = value
    self.right = None
  def insert_val(self, _value):
    if _value < self.value:
       if self.left is None:
         self.left = SortTree(_value)
       else:
         self.left.insert_val(_value)
    else:
       if self.right is None:
         self.right = SortTree(_value)
       else:
         self.right.insert_val(_value)

def display(_node):
   return list(filter(None, [i for b in [display(_node.left) if isinstance(_node.left, SortTree) else [getattr(_node.left, 'value', None)], [_node.value], display(_node.right) if isinstance(_node.right, SortTree) else [getattr(_node.right, 'value', None)]] for i in b]))

tree = SortTree(4)
for i in [5, 3, 1, 2, 8, 7, 4]:
  tree.insert_val(i)

print(display(tree))

Shell Sort

Shell sort is a variation of insertion sort. In insertion sort, we move elements only one position ahead. When an element has to be moved far ahead, many movements are involved. The idea of shellSort is to allow exchange of far items. In shellSort, we make the array h-sorted for a large value of h. We keep reducing the value of h until it becomes 1. An array is said to be h-sorted if all sublists of every h’th element is sorted.

  • Average: O(n(logn)^2)
  • Space: O(1)
  • Best: O(nlogn)
  • Worst: O(n(logn)^2)
def shellSort(arr):
    # Start with a big gap, then reduce the gap
    n = len(arr) 
    gap = n//2
  
    # Do a gapped insertion sort for this gap size. 
    # The first gap elements a[0..gap-1] are already in gapped  
    # order keep adding one more element until the entire array 
    # is gap sorted 
    while gap > 0: 
        for i in range(gap,n):
            # add a[i] to the elements that have been gap sorted 
            # save a[i] in temp and make a hole at position i 
            temp = arr[i] 
  
            # shift earlier gap-sorted elements up until the correct 
            # location for a[i] is found 
            j = i
            while  j >= gap and arr[j-gap] >temp:
                arr[j] = arr[j-gap]
                j -= gap
            # put temp (the original a[i]) in its correct location
            arr[j] = temp
        gap //= 2

Couting Sort

Counting sort is a sorting technique based on keys between a specific range. It works by counting the number of objects having distinct key values (kind of hashing). Then doing some arithmetic to calculate the position of each object in the output sequence.

  • Average: O(N+k)
  • Space: O(k)
  • Best: O(n+k)
  • Worst: O(n+k)

Example:
Input data: 1, 4, 1, 2, 7, 5, 2

1. Take a count array to store the count of each unique object.
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 2 2 0 1 1 0 1 0 0

2. Modify the count array such that each element at each index stores the sum of previous counts.
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 2 4 4 5 6 6 7 7 7

3. Output each object from the input sequence followed by decreasing its count by 1.

def countSort(arr):
    output = [0 for i in range(256)]
  
    count = [0 for i in range(256)]
  
    ans = ["" for _ in arr] 
  
    for i in arr: 
        count[ord(i)] += 1
  
    for i in range(256):
        count[i] += count[i-1]
  
    for i in range(len(arr)): 
        output[count[ord(arr[i])]-1] = arr[i] 
        count[ord(arr[i])] -= 1
  
    for i in range(len(arr)):
        ans[i] = output[i]
    return ans  

Radix Sort

Why we need Radix Sort? Well, the fastest the Comparison-based sorting algorithms can get is O(nlogn)(Quicksort, mergesort, heapsort…) but they cannot get better than that. Also for counting sort it’s O(n+k) so if the elements are in range from 1 to n^2 the performance would be O(n^2).(UGH)

That’s when radix sort comes in. It uses counting sort as a subroutine and starts sorting digit by digit from the least significant digit.

  1. sort input array use counting sort/any stable sort according to ith digit.
  2. repeat for each digir i from the least significant to the most significant digit.

For example:
unordered: 170, 45, 75, 90, 802, 24, 2, 66
1s place: 170, 90, 802, 2, 24, 45, 75, 66
10s place: 802, 2, 24, 45, 66, 170, 75, 90
100s place: 2, 24, 45, 66, 75, 90, 170, 802

Average: O(nk)

Space: O(n+k)

  • Best/Worst: O(nk)
from random import randint

def RadixSort(list,d):  
  for k in range(d):# d rounds of sort
    s=[[] for i in range(10)]#因为每一位数字都是0~9,故建立10个桶
        for i in list:  # start from least significant digit
            '''对于3个元素的数组[977, 87, 960],第一轮排序首先按照个位数字相同的
               放在一个桶s[7]=[977],s[7]=[977,87],s[0]=[960]
               执行后list=[960,977,87].第二轮按照十位数,s[6]=[960],s[7]=[977]
               s[8]=[87],执行后list=[960,977,87].第三轮按照百位,s[9]=[960]
               s[9]=[960,977],s[0]=87,执行后list=[87,960,977],结束。'''
            s[i/(10**k)%10].append(i) #977/10=97(小数舍去),87/100=0
        list=[j for i in s for j in i]
    return list

Bucket Sort

Bucket sort is used when input is uniformly distributed over a range.

  • Average: O(n+k)
  • Space: O(n)
  • Best: O(n+k)
  • Worst: O(n^2)
  1. create n empty buckets.
  2. insert array element into bucket[n* arr[i]].
  3. sort individual bucket using insertion sort.
  4. concatencate all sorted buckets.
def insertionSort(b):
    for i in range(1, len(b)):
        up = b[i]
        j = i - 1
        while j >=0 and b[j] > up:  
            b[j + 1] = b[j]
            j -= 1
        b[j + 1] = up
    return b

def bucketSort(x):
    arr = []
    slot_num = 10

    for i in range(slot_num):
        arr.append([])

    for j in x:   # Put array elements in different buckets  
        index_b = int(slot_num * j)  
        arr[index_b].append(j)

    for i in range(slot_num): # sort individual
        arr[i] = insertionSort(arr[i])

    k = 0
    for i in range(slot_num):
        for j in range(len(arr[i])):
            x[k] = arr[i][j]
            k += 1
    return x

Comb Sort

An unstale improvement over bubble sort. Bubble sort compares adjacent values so all inversions are removed one by one.

  • Time: O(n^2 / 2^p) where p is the number of increments.
  • Space: O(1)
  • Best: O(nlogn)
  • Worst: O(n^2)
def getNextGap(gap): 
    # Shrink gap by Shrink factor 
    gap = (gap * 10)/13
    if gap < 1:
        return 1
    return gap

def combSort(arr):
    n = len(arr)
    gap = n
  
    # Initialize swapped as true to make sure that 
    # loop runs 
    swapped = True
  
    # Keep running while gap is more than 1 and last 
    # iteration caused a swap 
    while gap !=1 or swapped == 1: 
        gap = getNextGap(gap) 
        # Initialize swapped as false so that we can 
        # check if swap happened or not 
        swapped = False
        # Compare all elements with current gap 
        for i in range(0, n-gap): 
            if arr[i] > arr[i + gap]: 
                arr[i], arr[i + gap]=arr[i + gap], arr[i] 
                swapped = True

Which algorithm for large dataset?

ref: stackoverflow or here

Depends on several factors:

  1. Can fit data in main memory? If not, then reply on external sorting algorithm – quick sort and merge sort.(must reside in a much slower external memory like hardrive, external sorting usually uses a hybrid sort-merge startegy, sorting phase writes data into chunks)

  2. Input distibution is mostly sorted? If not, Tim Sort.

  3. Element kind? if generic objects, only comparision sorting. If not, can use some non-comparision type like radix or counting sort.

  4. Can parallelize? Quick Sort, merge sort, BSD radix sort parallelize very well, while orthers like heapsort do not.

Leave a comment