What is data oriented design? [closed]

Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.

Closed 2 years ago .

I was reading this article, and this guy goes on talking about how everyone can greatly benefit from mixing in data oriented design with OOP. He doesn't show any code samples, however. I googled this and couldn't find any real information as to what this is, let alone any code samples. Is anyone familiar with this term and can provide an example? Is this maybe a different word for something else?

4,489 11 11 gold badges 49 49 silver badges 80 80 bronze badges asked Oct 29, 2009 at 4:13 66.5k 60 60 gold badges 201 201 silver badges 263 263 bronze badges

That article in Game developer is now available in easy to read in blog form: gamesfromwithin.com/data-oriented-design

Commented Jun 3, 2011 at 4:38 Here's an Aggregate of DOD content on the web Commented Feb 4, 2014 at 7:44 Also many other related links: github.com/dbartolini/data-oriented-design Commented Aug 13, 2022 at 4:37

6 Answers 6

First of all, don't confuse this with data-driven design.

My understanding of Data-Oriented Design (DOD) is that it is about organizing your data for efficient processing. Especially with respect to cache misses etc. Data-Driven Design on the other hand is about letting data control a lot of the behavior of your program (described very well by Andrew Keith's answer).

Say you have ball objects in your application with properties such as color, radius, bounciness, position, etc.

Object Oriented Approach

In OOP you would describe balls like this:

class Ball < Point position; Color color; double radius; void draw(); >;

And then you would create a collection of balls like this:

vector balls;

Data-Oriented Approach

In Data Oriented Design, however, you are more likely to write the code like this:

class Balls < vectorposition; vector color; vector radius; void draw(); >;

As you can see there is no single unit representing one Ball anymore. Ball objects only exist implicitly.

This can have many advantages, performance-wise. Usually, we want to do operations on many balls at the same time. The hardware usually wants large contiguous chunks of memory to operate efficiently.

Secondly, you might do operations that affect only part of the properties of a ball. For E.g. if you combine the colors of all the balls in various ways, then you want your cache to only contain color information (which DOD allows). However, when all ball properties are stored in one unit (like OOP) you will pull in all the other properties of a ball as well. Even though you don't need them.

Cache Usage Example

Say each ball takes up 64 bytes and a Point takes 4 bytes. A cache slot takes, say, 64 bytes as well. If I want to update the position of 10 balls, then:

In OOP, I have to pull in 10 x 64 = 640 bytes of memory into cache and get 10 cache misses.
In DOD however, I can pull the positions of the balls as separately (without pulling unit's other properties), that will only take 10 x 4 = 40 bytes. That fits in one cache fetch. Thus we only get 1 cache miss to update all the 10 balls.

These numbers are arbitrary - I assume a cache block is bigger.

But it illustrates how memory layout can have a severe effect on cache hits and thus performance. This will only increase in importance as the difference between CPU and RAM speed widens.

How to layout the memory

In my ball example, I simplified the issue a lot, because usually for any normal app you will likely access multiple variables together. E.g. position and radius will probably be used together frequently. Then your DOD structure should be:

class Body < Point position; double radius; >; class Balls < vectorbodies; vector color; void draw(); >;

The reason you should do this is that if data used together are placed in separate arrays, there is a risk that they will compete for the same slots in the cache. Thus loading one will throw out the other.

So compared to Object-Oriented programming, the classes you end up making are not related to the entities in your mental model of the problem. Since data is lumped together based on data usage, you won't always have sensible names to give your classes in Data-Oriented Design.

Relation to relational databases

The thinking behind Data-Oriented Design is very similar to how you think about relational databases. Optimizing a relational database can also involve using the cache more efficiently, although in this case, the cache is not CPU cache but pages in memory.

A good database designer will also likely split out infrequently accessed data into a separate table rather than creating a table with a huge number of columns where only a few of the columns are ever used.
Or, he might also choose to denormalize some of the tables (maybe into a single table), so that data don't have to be accessed from multiple locations on disk.

Just like with Data-Oriented Design these choices are made by looking at what the data access patterns are and where the performance bottleneck is.

8,527 5 5 gold badges 45 45 silver badges 84 84 bronze badges answered Jan 7, 2010 at 16:29 Erik Engheim Erik Engheim 8,442 4 4 gold badges 38 38 silver badges 51 51 bronze badges

well said; I've got only one question though. Let's say we have a structure struct balls , wouldn't updating the position of each ball actually thrash the cache since you'd move back and forth between the velocity vector and the position vector (yes modern machines and cache-lines and all that, this is also just an illustration)? Commented Aug 3, 2010 at 12:12

It might. But remember the whole pos array will not be pulled in at a time. Just one cache line, and possible some prefetching. Likewise with velocity. So for them to trash each other each corresponding chunk of pos and vector have to map to the same cacheline. That can of course happen, which is why the recommendation is to put variables that are used together together in a struct. So e.g. velocity and pos would be in one vector while color would be in another vector.

Commented Aug 17, 2010 at 8:30 It looks like switching from an array objects to an object of arrays. Commented Nov 13, 2016 at 14:36

Very well explained. Recently (since 2018) Unity game engine has been modified to use Data Oriented Design to exploit the CPU caches to gain good performance in Games and guess who helped unity achieve it ? Mike Acton and Andreas Fredriksson :) here is the link for more info.

Commented Jun 23, 2020 at 7:52

@JMas True, but this mapping isn't arbitrary. CPU caches still work. Caches have to work for what looks contiguous to the process otherwise we would have killed performance and there would have been no point to CPU caches. I don't know the exact details of how this works. But when a process access a byte at virtual memory address X, then you can expect 64 bytes starting from address X getting pulled into cache from memory. Yes sometimes swapping needs to happen. But on repeated access this is generally what happens.

Commented Dec 10, 2021 at 10:15

Mike Acton gave a public talk about Data oriented design recently:

My basic summary of it would be: if you want performance, then think about data flow, find the storage layer that is most likely to screw with you and optimize for it hard. Mike is focusing on L2 cache misses, because he's doing realtime, but I imagine the same thing applies to databases (disk reads) and even the Web (HTTP requests). It's a useful way of doing systems programming, I think.

Note that it doesn't absolve you from thinking about algorithms and time complexity, it just focuses your attention at figuring out the most expensive operation type that you then must target with your mad CS skills.

answered Jun 23, 2015 at 10:52 Aleksei Averchenko Aleksei Averchenko 1,776 1 1 gold badge 17 17 silver badges 32 32 bronze badges Okay probably not HTTP requests :) Commented Nov 11, 2022 at 12:27

I just want to point out that Noel is talking specifically about some of the specific needs we face in game development. I suppose other sectors that are doing real-time soft simulation would benefit from this, but it is unlikely to be a technique that will show noticeable improvement to general business applications. This set up is for ensuring that every last bit of performance is squeezed out of the underlying hardware.

answered Feb 23, 2010 at 2:03 181 1 1 silver badge 2 2 bronze badges

Agreed. Some other areas where data-oriented design is significant are: hardware and firmware for high-bandwidth devices (e.g. networking or storage); large scale scientific computing (e.g. weather simulation, protein folding), signal processing (e.g. audio, image, video), data compression. These fall under the "Computational Science and Engineering" which is sometimes offered as a separate major from the more typical Computer Science.

Commented Jul 10, 2013 at 8:45

A data oriented design is a design in which the logic of the application is built up of data sets, instead of procedural algorithms. For example

int animation; // this value is the animation index if(animation == 0) PerformMoveForward(); else if(animation == 1) PerformMoveBack(); . // etc

data design approach

typedef struct < int Index; void (*Perform)(); >AnimationIndice; // build my animation dictionary AnimationIndice AnimationIndices[] = < < 0,PerformMoveForward > < 1,PerformMoveBack >> // when its time to run, i use my dictionary to find my logic int animation; // this value is the animation index AnimationIndices[animation].Perform();

Data designs like this promote the usage of data to build the logic of the application. Its easier to manage especially in video games which might have thousands of logic paths based on animation or some other factor.

answered Oct 29, 2009 at 4:28 Andrew Keith Andrew Keith 7,555 2 2 gold badges 26 26 silver badges 41 41 bronze badges

This is actually not correct. You are confusing data oriented design with data driven design. I did the same thing until I read Noel's article and realized he was talking about something entirely different.

Commented Jan 7, 2010 at 15:00

Also, Indice is not a word. There's "index" and "indices" and some even condone "indexes", but "indice" is never right.

Commented Mar 18, 2011 at 18:38

I use "dex" for index and avoid using plural in my code as the rules for making things plural are not regular. And code should be regular. If you have to break the rules of english to make the code more [ uniform / regular ] do it.

Commented Dec 27, 2020 at 5:08 This answer shed a precious light to this, thanks to be sometime 'not correct' !, +1 ;o) Commented Apr 10, 2022 at 20:02

I first heard about Data-Oriented Design in the Our Machinery podcast, episode "S3: EP4 Data-oriented Design". https://www.owltail.com/podcast/Atvr2-Our-Machinery

Maybe this has changed, but a while ago finding information about Data-Oriented Design was difficult. The only book I found is: https://www.manning.com/books/data-oriented-programming

answered Aug 22, 2022 at 20:16 119 9 9 bronze badges

The second link is about a different paradigm. The only book I know of is the one which I wrote: dataorienteddesign.com/dodbook also available on amazon : amazon.com/dp/1916478700

Commented Mar 13, 2023 at 12:40

If you want to take advantage of modern processor architecture, you need to lay out your data in memory in a certain way. CPUs are really good at processing simple types that are laid out sequentially in memory. Any other layout has a much higher processing cost.

In object-oriented approach, you always think about one instance, and then you are extending it to several instances by grouping objects into collections. But from the hardware point of view, this comes with the added cost.

In data-oriented approach, you don't have an "instance" in the same way you have in object-oriented programming. Your instance can have an identifier, similar to data in relational databases, but apart from that, data related to your instance can be split over several tables (tables are implemented as vectors), to allow efficient processing.

An example: imagine you have class Student < int id; std::string name; float average; bool graduated; >. In case of OOP, you would put all your students in a single vectors.

In data-oriented design, you will first ask yourself what kind of processing you want to do to this data. Say you want to calculate an average mark for all students that still haven't graduated. So you will create a table which contains only students that have graduated, and another that haven't. You won't keep the student name in that table since it is not used for processing. But you will keep a student ID and an average mark in the table.

Now calculating average mark for non-graduated students will mean iterating through the non-graduated table and performing the calculation. Since average marks are neighboring in memory, your CPU will use SIMD and process the data in the most efficient way possible. Since we are not querying the bool graduated to test if the student has graduated, there are no data cache misses.

This sounds nice in theory but I have never done this kind of development on a real-world project. If anybody have any experience, please contact me, I have many questions.