Home

Mock Early

In big data, one thing you need to adjust to is how long it takes to do even simple things with a large dataset. It's easy to be over-optimistic and want to try to make changes and test them on the whole dataset. The problem of course is if you didn't make the change correctly you'll have to go through the entire test iteration again. This could take just a few minutes, but a few minutes per iteration can easily add up to an hour.

In general you never want to do test iterations over the complete data if it takes more than a minute and the change is not very simple. Instead you want to sample or mock the data, probably in a reusable way. Err on the side of doing this even if it seems like you won't need to because it's easy to get lazy or to underestimate the amount of time this will save.

The worst case scenario is if you continue to add more complexity without any kind of sampling or mocking. As the project continues to grow the cost of setting up sampling or mocking also grows and it becomes difficult to make any kind of change quickly.