Being able to experiment with big data and queries in a safe and
secure “sandbox” test environment is important to both IT and end
business users as companies get going with big data. Nevertheless,
setting up a big data sandbox test environment is different from
establishing traditional test environments for transactional data and
reports. Here are ten key strategies to keep in mind for building and
managing big data sandboxes:
1. Data mart or master data repository?
The
data base administrator needs to make a decision early-on as to whether
to have test sandboxes use data directly from the master data
repository that production uses, or whether the best solution is to
replicate and splinter off sections of this data into separate data
marts that are reserved for testing purposes only. The advantage of the
full data repository is that testing actually uses data that is used in
production, so test results will be more accurate. The disadvantage is
that data contention can be created with production itself. With the
data mart strategy, you don’t risk contention with production data—but
the data will likely need to be periodically refreshed to stay in some
degree of synchronization with data being used in production if it is
going to closely approximate the production environment.
2. Work out scheduling
Scheduling
is one of the most important big data sandbox activities. It ensures
that all sandbox work is optimally being run. It usually achieves this
by concurrently scheduling a group of smaller jobs that can be completed
while a longer job is being run. In this way, resources are allocated
to as many jobs as possible. The key to this process is for IT to sit
down with the various user areas that are using sandboxes so everyone
has an upfront understanding of the schedule, the rationale behind it,
and when they can expect their jobs to run.
3. Set limits
If
months go by without a specific data mart or sandbox being used,
business users and IT should have mutually acceptable policies in place
for purging these resources so they can be put back into a resource pool
that can be re-provisioned for other activities. The test environment
should be managed as effectively as its production environment
counterpart so that resources are called into play only when they are
actively being used.
4. Use clean data
One of the
preliminary big data pipeline jobs should be preparing and cleaning data
so that it is of reasonable quality for testing, especially if you are
using the “data mart” approach. It is a bad habit (dating back to
testing for standard reports and transactions) to use data in test
regions that is incomplete, inaccurate, or even broken—simply because it
was never cleaned up before it was dumped into a test region. Resist
this temptation with big data.
5. Monitor resources
Assuming
big data resources are centralized in the data center, IT should set
resource allowances and monitor sandbox utilization. One area often
requiring close attention is the tendency to over-provision resources as
more end user departments engage in sandbox activities.
6. Watch for project overlap
At
some point, it makes sense to have a corporate “steering committee” for
big data that tracks the various sandbox projects going on throughout
the company to ensure that there is no overlap and/or duplicated effort.
7. Consider centralizing compute resources and management in IT
Some
companies start out with big data projects in specific departments but
quickly learn that they can’t work on big data, do their daily work, and
then manage compute resources, too. Ultimately, they move the equipment
into the data center for IT to manage. This frees them to focus on the
business and ways that big data can bring in value.
8. Use a data team
Even
in sandbox experimentation, it’s important to have the requisite big
data skills team on hand to assist with tasks. Typically, this team
consists of a business analyst, a data scientist, and an IT support
person who can fine-tune hardware and software resources and coordinate
with database specialists.
9. Stay on task with business cases
It’s
important to infuse creativity into sandbox activities, but not to
where you totally forget the initial charge of the business case you’re
trying to bring value to.
10. Define what a sandbox is!
Especially
participants coming from the end business might not be familiar with
the term “sandbox” or what it implies. Like the childhood sandbox, the
purpose of a big data sandbox is to freely play and experiment with big
data—but to do it with purpose. Part of this purposeful activity should
be abiding by the ground rules of the sandbox, such as when, where and
how to use it, as well as experimenting to derive meaningful results for
the business.
Follow @GoldenWords_UK
0 Comments: