onsdag 18. juni 2014

[Data structures - Part 1] Vector

Introduction


In this series I will be talking about the major data structures, what operations they support and how to implement it. It is intended mainly as a way for me to learn these things better, but I also hope it will be of use to someone.

At the same time I will try to describe the functionality of the various classes in the STL.

Let's start with one of the most basic one, vector.

ADTs


Vectors are ADTs ( Abstract Data Type ) This means it's not a concrete data type like ints, pointers, arrays, etc... An ADT is just a data structure with no specific implementation. It could be implemented as an array, but it could also be implemented as a linked list ( I'll write a post on this later. In simplicity, it's just a container where each element has a pointer to the next one. )

An ADT has a specific set of operations that can be performed on it. On a vector, you can add an item to the back( push ), remove the element in the back ( pop ) or read a random element. In most implementations, vectors are implemented as an array because it's way simpler and more efficient. But you could implement it as a linked list if you wanted to.

The Vector ADT


Vectors are one of the most important structures. According to Bjarne Stroustrup
If you know vector and int, you know C++

While this is not entirely true, vectors are very useful, and very fast.
Vector is an array-like container. It's very useful for storing data, and it's particularly focused on adding/removing elements from the back.

Vectors are focused around adding and removing elements to the end of the vector. But it also supports random access, so it will work as a regular array in many cases.

Vector is a good choice if you need to store a list of object and you intend to add items as you go along. As long as you add/delete at the end, vectors are very fast. Adding and removing at the end is O( 1 )

Supported operations


The following functions are typical to a vector:
  • Push Back
    • Insert at the end
  • Pop
    • Remove last element
  • Back
    • Access last element
  • operator[ ]
    • Access a specified element


Implementation


The random access functionality of a vector is why it's usually implemented using an array. And when using an array, the items are laid out consecutively in memory. This means that, if element 0 is at pos x, then element 1 is at position x + size, where size is the size of one element. And the CPU works VERY efficiently when the memory is laid out like this. And that is one of the major reasons it's so widely used.

And this is one of vectors shortcomings. Because when extending a vector, you usually need to move all elements to a different place in memory. And since your vector can potentially hold millions of elements, this might take some time.

Luckily though, this is one of the areas the CPU excels. It's incredible effective when it comes to working on consecutive blocks of memory. So in a lot of cases, the overhead of increasing the capacity of a vector is not so bad.

Data members


The following are the data members a standard vector needs to have. Note: these are intended to show the basic member of a vector. The members below are not the same as the ones in std::vector, they're just intended used as an example.

Type[] data


The array that holds all the data. It contains all elements in the vector. The memory for the object is dynamically allocated with new, since a vector needs to be able to grow.

unsigned int size


The number of elements in the vector. Note: this is not the same as capacity of the vector. The size of a vector is just how many elements are in the vector right now. It is equivalent to the length of a string.

unsigned int capacity


How many items the vector can currently store. This number will always be as large, or larger, than the size of a vector.

Size vs capacity


Every time you insert an item, the vector needs to check whether it is equal to the capacity of the vector. If it is, the vector is full and the capacity needs to be increased. This also means new space have to be allocated and the elements has to be moved. See reserve further down for details.

Inserting


When it comes to insertions, there are a few cases to consider

Inserting at the end ( index = size of vector )


Say you have a vector of 11 elements. The first element is at index 0, the las one at index 10. Inserting an element to the back ( index 11 ) means you have to do the following:
  1. Insert the item at the next empty position ( 11 )
  2. Increase the size by 1
Done!

Inserting in a differend position


Now let's assume you have the same vector as before ( 11 elements ) and you want to add an element at index 5 :
  1. Take all elements from index 5 to 10 ( the last element ) and move them back one position ( increasing index by 1 ). The last item will now be at index 11 and the position 5 will be empty
  2. Now you can insert the item into position 5
  3. Increase the size  by one
The function might have to move a lot of elements. But this isn't a huge worry since CPU's work very efficiently on consecutive blocks of memory like arrays and vectors.

Functions in std::vector


These are all the supported functions of std::vector you find in the C++ STL ( Standard Template Library. ) This section does not contain the algorithms like std::find(), I will cover these later.

Push Back


The basic insertion algorithm, it adds an element to the end of the vector. It will also increment the size by one. If the current capacity of the vector is too small to fit this element, the capacity will be extended. See the point Reserve( int ) for more info.

Reserve


This function increases the capacity of the vector. The function will do the following :
  1. Allocate a new, larger, chunk of memory.
  2. Copy all elements to this chunk of memory
  3. Deallocate the old chunk of memory

If you try to reserve more than the theoretical limit of the vector ( see Max Size ), and exception of the type std::length_error will be thrown.

A few notes :
  • The function might have to move a lot of elements.
    • In most cases this isn't a huge concert since CPU's are very efficient at this.
  • If you are constantly inserting a lot of elements, this function might get called automatically a lot. 
    • To prevent this, you can call Reserve() with a large value so that the size never grows beyond the capacity so that the capacity won't have to be extended.
  • You can't use this function to shrink the capacity.
    • If you call this function with a value smaller than the current capacity nothing will happen.
    • You can use ShrinkToFit() to decrease the capacity ( see below. )

Shrink To Fit


Will reduce memory usage by freeing unused memory. Since it deallocates all unused memory ( all elements after size - 1 ), capacity will be set to the same as size. If you try to PushBack or Insert an item after calling this method, Reserve() will be called.

Back


Simply returns the last element.

Front


Simply returns the first element.

operator [int]


Returns the element specified the brackets[]. Does NOT perform any bounds checking. If you specify a number below 0, or equal to or more than size. The function returns a reference, so this function can be used to change the element in the vector.

At( int )


Just like the [], this returns the elements specified in the function argument. But unlike the [] operator this function performs bounds checking. If you specify an index out of bonds, an exception of the type std::out_of_range is thrown.Just like [], function returns a reference, so this function can be used to change the element in the vector.

So why use the brackets[]?


While at() is technically safer, there is also a little overhead in performing the bounds checking. [] can potentially be run in just in one clock cycle.

My general rule of thumb is to use [] in your code. If you check and test your code properly, stepping out of bounds should not be a risk.

Empty


Returns true if the vector is empty, otherwise false.

Size


Returns the number of elements currently in the vector.

Capacity


The number of items the vector has allocated space for. If this is equal to the sixe, the vector needs to be expanded ( see Reserve() above. )

Data


Returns the underlying data of the vector.

Max Size


Returns the the maximum theoretical capacity of the vector, meaning the maximum number of elements the vector can hold using all avaialbe memory.Note : this is not the capacity of the vector.

When size reaches capacity


So when the size reaches the same value as capacity, the capacity of the vector needs to be extended. Usually this is checked whenever an element is inserted and the function Reserve( int ) will be called.


Illustration


Here is a simple illustration of a vector.













It demonstrates insert ( both at the back and any other position, ) and remove. It also shows back element ( last element in the vector, ) size ( how many items that are in the vector ) and capacity ( how many items is there room for. )

This code is meant as a base for later parts where I will show various sorting algorithms. You can find the code here.


=========================

A container we'll look at later, linked list solves the problem of capacity limitations.



Feel free to comment if you have anything to say or ask questions if anything is unclear. I always appreciate getting comments.

For a full list of my tutorials / posts, click here.

Ingen kommentarer:

Legg inn en kommentar