Skip to content

Unstructured Data and the Cloud

Everyone is embracing the cloud, from enterprises, to vendors to analysts. Clouds, whether public or private are designed to furnish on-demand and elastic compute, infrastructure and storage resources as needs rise and fall. One particular aspect of the public cloud that is gaining momentum is data storage or Storage Infrastructure as a Service (SIaaS). This is being driven by the ratio of ancillary storage required to support 1 Terabyte of data, currently about 9:1. Gartner has predicted that data will grow at a rate of 800% over the next few years, with 80% of that growth being attributed to unstructured data and Forrester have stated that 70% to 80% of existing enterprise data is unstructured. And when this 9:1 ratio is combined with the predicted growth of data and especially unstructured data, the storage investment required is staggering.

While the benefits of a SIaaS model are glaringly obvious, I’ve been mulling over a central issue for some time. Do enterprises understand the composition of their unstructured data to be moved into public clouds through this model and the implications of not knowing? The crux of the issue lies in the lack of visibility into how much unstructured data currently resides within their environments, its relevance (age and redundancy), the types and who owns or accesses it. Without the availability of this information, enterprises are flying blind as to where to position their data to ensure business operations are not impacted.

One obvious option is for enterprises to simply move all their unstructured data into a public cloud infrastructure through a SIaaS provider and live with the lack of visibility. While this may provide a quick fix, does it really mesh with their overall management strategy for unstructured data?

Before making a move to a public cloud storage model, enterprises have to be able to answer some key questions concerning their unstructured data.

  • What is the sum of all unstructured data?
  • Where does the existing unstructured data reside?
  • What percentage of the unstructured data is redundant?
  • What file systems are in use (CIFS, NFS, NTFS)?
  • How are file systems structured?
  • What are the types of unstructured data?
  • Who owns the unstructured data?
  • Who has access (groups, shares, individuals)?
  • How recently has the data been accessed (hour, day, week, month)?

The simplest way to answer the above questions is for enterprises to take an inventory of all their unstructured data no matter where it sits. Once collected and a baseline of knowledge is established, enterprises are now able to leverage the inventory information to make informed decisions about which data is migrated, meet critical business requirements, ensure consistency with data management strategies and comply with internal and regulatory requirements for data retention. In the next posting I will provide some best practices for inventorying unstructured data.


Is central data storage a good idea?

Current trends in the IT storage arena focus on businesses moving to a centralized filer with an off-site Disaster Recovery file as the file storage model. Is this the best approach? Reminds me of mainframes, remember? It certainly makes sense for IT to only have one server to maintain. Sounds easier, doesn’t it? But is it the best fit for the actual business content?

By their very nature businesses are silos with employees have specialized roles. Each silo has some data that is shared, while the bulk, perhaps 80% is used only by the department that created it. Each department has their own applications as the tools for their business function. Wouldn’t it make sense from a business content perspective to have department based servers?

Now picture an architecture based on VMs (virtual machines like HyperV and VMware) running on those servers as the shared resource per department. A server sized for a department may be priced at nearly a commodity level. Bigger departments pay more and get more. Having a spare VM server then becomes an easy decision. A spare server could provide crunch time power boost, as well as fail-over and maintenance capability possible because VMs can be migrated between physical servers.

Advantages of a distributed model

1) When a physical server fails it doesn’t take down the whole company. In fact, that department may move their VM to the spare server.
2) Finding information on a department server may be faster than on a central server because you are physically bound to looking in a smaller space.
3) The network load is distributed and not bottle-necked to one server.
4) Data grows at different rates and at different times for each business group. Adding storage directly for those whose budget pays for it is easy in a distributed model.
5) It is easier to backup separate servers that are smaller.
6) Centralized servers can become bottlenecks as the business evolves over time.
7) Many businesses operate from geographically distributed offices. Having a distributed model in the headquarters maps easily to remote offices.
8) Some parts of a business may map readily to being run from the cloud. A distributed model enables that transition to happen without impacting the rest of operation.

Don’t be fooled by email spam, include these 3 steps:

Email spam continues to evolve in sophistication. Even with spam filters on, they get through. Is the email something that may help my business or cause havoc with a virus? I received one from a “World Registry” today. How do I know if I should delete it? Below are the 3 steps that I take for processing questionable email.

Step 1: Set your email client to block remote content. With Thunderbird it promotes me to “Show Remote Content” if I want to see more than plain text. Other clients like outlook provide a “download pictures” spot to click. This has two positive effects, one is that it displays the email faster because it’s not loading graphics and two, it’s blocking active content that could be malicious. In the case of the suspect email, I go to step two.

Step 2: Typically in email clients there is a way to see the “source” or “header” of the email. It takes only a couple of seconds to do for questionable email and beats the alternative of hours recovering a computer with a virus. Sometimes it’s a “tools” pull-down or in outlook send/receive download headers or in the case of Thunderbird, there is an “Other Actions” pull-down on the right side that has a “View Source” selection. When the header file pops up, you’ll see a lot of gibberish displayed. Look for the keywords, “Return-Path”, “Received: from”, “To:”, “From:” and “Sender:”. For each keyword found, look at the names supplied for the email address. Some will be coded because they are intermediate hops, but there should be at least one “from” name that makes sense. If none of them look like normal names that is a pretty good sign that the sender is hiding their identity. Or if the email appeared to be from a company, is the sender’s real address from the company? In my case I saw that the sender was a generic account sent to an email account of mine used to catch wrong email addresses. In other words, it was not a person-to-person email. As a final check, go to step three.

Step 3: Google or Bing it. Use the title of the email. Many times someone has received the email before you. In my case, someone had received the same email and provided the clue that it was a scam to lock any recipient who responds into a contract for 3 years paying 990 euros a year. As enticing as that email offer was, my response was to delete the email.

In summary:
1) Block remote content until you know it’s worth viewing
2) Check out who really sent the email.
3) See if anyone else saw the same email.

How long to copy files, move or data migrate terabytes of data?

The first question to consider when contemplating a filer data migration is how long will this take? It depends. What problem are you trying to solve? Let’s say that the task is migrating all the file data from an old server to a new filer. In general that process involves:

1)      Set up a new server/filer ready to be loaded.

2)      Size the data.

3)      Deciding if some data is to be left behind and later removed.

4)      Set up admin credentials for the copy job.

5)      Select a time window to do the copying.

6)      Do the bulk copy during the least busy window.

7)      Review the files and identify those with issues as shown in the log file.

8)      Correct issues and begin a series of incremental copies.

9)      Warn users of the impending cutover.

10)   Do the cutover copy.

11)   Switch the users to the target server.

12)   Confirm files are correct for users on the target.

Other copy tasks will take more or less time depending on the file filtering, file permissions, and volume configuration, for example. Addressing those combinations are too numerous to outline here.

How long will it take to copy files?

I recommend the following process to size the job. Start by estimating the time to do the job using the back of the envelope figures below. Then run a copy program like FilePilot Copy in “Dry Run” mode. This is will provide a close estimate of the time to copy while only hitting the source filer. A “Dry Run” also provides an early window into the issues in your environment that may prevent files from being copied. Address the issues found and plan when to do the bulk copy, now that you know how long it will roughly take.

Assumptions for the back of the envelope estimate.

Hardware factors are network, servers and disk performance. File content factors are file size, type, attributes, and ACLs laid out in a pattern of folders. Both of these groups of factors affect the performance of the copy job.

For this back of the envelope calculation the hardware factors are two windows servers, windows 2008 OS, with SATA 7200 RPM drives, and a 1 Gigabit network. For the file content factors if we use a copy of the C drive OS and program files, there is a mix of files from 0 bytes to megabytes and various file attributes along with varying folder depths. This model provides approximately 5-10% of files that will fail to be copied depending on the copy tool used. This helps with the accuracy of the copy performance numbers cited below. This configuration has produced peak copy rates in the 70-90MiBps range.

Software copy tool is setup in the push model with the copy product on source pushing to target server. Software copy tool used for this data migration test was FilePilot Software Copy.

  • Average rate 25 MiBps
  • Peak rate 35 MiBps
  • Time to copy or (Dry Run) 1TB is 12 hours
  • Time to update copy 1TB is 400 seconds or 6.7 minutes

12 * Y TB is the number of hours for the bulk copy. Because the bulk copy stresses the servers and network, you will want it completed off-hours if possible. The time to update copy is your daily copy up to the day of the cutover. The cutover is also performed with the update copy. The update copy needs to be completed as quickly possible as the source server or filer will need to be off-line to users during the copy.

What do you do if the number of Terabytes you need to move in the bulk copy is going to run beyond your copy window into business hours? You have some choices. With copy products like FilePilot Software you may run as many copies of the program as you need. A 4 TB copy with 4 copies will take about the same time as a 1TB copy. The assumption is that the network bandwidth hasn’t been fully tapped and that multiple volumes exist on both servers. The second option is to start the copy on the weekend and as it runs into the work week, turn down the copy rate as in FilePilot Copy is done in the GUI by dragging a slide bar to dynamically control copy rate. At night, slide it up again. After the bulk copy has completed it’s time to run the update copy.

This overview for sizing up a data migration project should provide a good starting point for your next data migration project.

Compliance and access to data in a multi-office business

By Richard B. Knowles

The cloud has opened the door on requirements for the next level of compliance to privacy protection. In looking around at the information available, one of the best sites is the site. They have an excellent collection of documents available for download. The usefulness of these documents is not limited to cloud use-cases. I started with the last document, ‘Target Data Tracker’. This is the classic eye-opener, spreadsheet questionnaire. I started thinking of the situations that may be detected just by filling out the questionnaire. Consider that during day-to-day business activities that are carried out by multiple people, it’s easy to overlook subtle details. For example, did any client data land in a “temporary” location? Was it properly removed afterwards? Because of security blocks in other parts of the infrastructure, direct movement of information may not be possible. This leads to leaks in data protection. The classic case of this is the boss who arrives from headquarters to a remote office. The boss as an ‘outsider’ is unable to connect to the net, email and other data sources in the company while at the remote office. This leads to the IT staff punching a hole in the firewall for the boss. A couple of observations that follow are, if a scanner from outside catches the opportunity, they may get in. The second case is that the hole in security remains open after the boss leaves because IT is busy with other fires. This case can be a much larger window of security vulnerability.
An alternative approach is to build in a process and tools to consistently provide the traveling team their critical data no matter where they land. FilePilot Copy enables this type of solution. Simply setup a share at each location with the security permissions locked down for use by the traveling party. Adjust the firewall for the FilePilot Software specific protocols and IP addresses of the systems involved. Run FilePilot Copy to copy files in a ‘sync’ mode that keeps both shares identical. Because checking for updates is very efficient, the program may be run on a schedule to ensure that the shares remain close, even during frequent updates by the file owners.
This solves the exposure of critical corporate file data to unauthorized parties while enabling efficient working conditions. As for email and other external accesses, setup a DMZ and provide allow the travelers to connect to that network in the zone. Their portable devices should already have a public network configuration that would apply.

Sharing group documents in one or more locations

Leveraging shares for group projects

By Richard B. Knowles

How do you currently share documents between team members who are working on a common project? There are many approaches to finding the right solution for your situation and it basically comes down to time, effort and money. First, how are the documents stored? Shared data may take one of two forms, data in a database or data in files. I’ll focus on the file data approach. The advantage of files for data storage is that it is highly scalable, easily stored at multiple sites and the data is separable for reuse in other projects. The downside is that the organization of data is not enforced by data structure itself and is left to the team to implement.

Let’s look at setting up a file data based project. How does one choose a document management for projects solution?

There are many factors involved in selecting what is the best approach for your organization.

  1. What problem are you trying to solve?
  2. What is the time frame to have a solution in place?
  3. How many people are sharing the documents?
  4. How long will people hold the document before passing on to others?
  5. What are the rules for modifying and commenting in the documents?
  6. Is the project in one location or multiple locations?

The approach presented in this article is a simple one that can be implemented quickly and works best for small project teams of less than a dozen. It may work for larger teams with well-structured rules for document workflow or in cases where there are logical sub-groups of documents. The solution consists of two parts. First is the creation of a process that the team agrees upon and secondly, the creation of a “share” on a central server. This approach should take no more than a few hours to have fully up and operational.

Factors that are driving this approach:

  1. Create a simple, flexible environment for ad hoc project document sharing.
  2. Need it now.
  3. Have a small, agile team.
  4. Fluid time of ownership for holding documents.
  5. Fast email or IM of links to pass the document baton. (as one mechanism option for workflow)
  6. No new tools need to be learned by the team and therefore it’s instant-on.
  7. Will work for one or more locations.

Setup your process

Setup ground rules and have them available to the team. Make sure everyone understands the process. Some considerations for the process are:

  1. How is the baton passed from writer to writer?
  2. Who says when a document is done? Is there an individual or committee? Or is it voting?
  3. What are the types of documents allowed in the project?
  4. Who may add new documents to the project?
  5. Who may delete documents?
  6. Are completed documents moved to a completion folder?
  7. Is there a folder hierarchy that maps to the type or context of the project?
  8. Who owns backup and versioning of the documents?
  9. Do the documents need to be mirrored or synchronized on one or more locations?

Once the workflow has been determined, it is time for the IT part of the setup.

Setup storage for containing the documents:

[If there is more than one location, then simply replicate the model as described below for a single location.]

You will need a computer for storage of the project data. This can be as simple as creating a share on one of the team member’s computer or a share on a central server. But first what is a “share”? A “share” is a handle for accessing a specific folder, on a specific server. Microsoft Windows operating systems have the capability for allowing one or more people to access a computer and specific folders across a network using shares. This works equally well for NAS storage or computers running Windows OS. Setting up a share begins with right mouse on a folder to be shared and select “share with” and choose “specific people”. Once the share has been setup, then users can gain access to it through network links, shortcuts or one can even map a disk drive on the user’s computer to access the share.

IT Management of the Project

Setup initial location for document storage:

  1. Create the project share and sub-folders.
  2. Setup the ACLs and sharing users
  3. Setup the links, shortcuts and drive mappings as required.

Multi-site project coordination and setup:

Using a tool like FilePilot Copy, you really only need to setup and populate one site. To create the additional sites, run FilePilot Copy as described here.

  1. First run FilePilot Copy using simple copy mode with the create share configuration option
  2. Setup a FilePilot copy configuration using sync copy mode and run when the team needs to synchronize the sites. This mode will check files on both sites and simultaneously copy files in both directions. If there are any uncertainties then the files are left alone and the condition is noted in the copy log file.

Project Backup:

  1. Active projects create very dynamic data files. Therefore the useful life span is short.
  2. The best option is to allocate space on an additional server for a secure copy of the project.
    a. Fast setup using the FilePilot Copy with simple copy mode to establish the “full backup”.
    b. Updates can happen as frequently as every few minutes, nightly or weekly.
    c. This solution doesn’t add to the backup tape load with the project’s intermediate data.
    d. Easy and fast to “restore” from the backup disk any damaged files.
    e. When milestones are reached, the project can be copied using FilePilot Copy to a staging location on the backup server for backup to tape.

Advanced Backup:

A remote location may be used to provide backup or disaster recovery of the project. Simply follow the “multi-site” instructions.

Summary of Benefits:

  1. Can be implemented in a few hours.
  2. Pointers to the documents through links or shortcuts can be quickly setup and reduce the load on the email system as links to the documents can be passed in short emails instead of the whole document.
  3. Links or names of the documents can also be passed around by IM.
  4. Reduced network load from not emailing documents on every change to the documents
  5. The project can be backed up as a group making recovery easier
  6. The project can be moved to lower cost storage when the project is no longer in development by tools like FilePilot Software’s FilePilot Copy tool. For more information see
  7. No training on new tools is needed.

5 Tips for Buying IT Software

By Rick Knowles
These are tips based on what goes on behind the scenes at software companies.
  1. Select it because it does one thing really well.
    • Real value is found when the product addresses your problem directly.
    • Being forced to buy a “suite” when one feature is all you need is a waste of money and time. The suite product is typically harder to license, harder to install and harder to use. After all, if you like how the single item product works wouldn’t you buy other products from them?
    • There will almost certainly be a ‘deeper’ set of features and functionality around a single purpose product.
  2. Select it because you can use it for multiple purposes or tasks.
    • Many times a well-focused product is recognized as useful for multiple tasks.
    • For example, a file copy program can be used for copying files or aggregating files for backup.
    • If the key purpose of the software is strong enough, it can be reused for other tasks.
    • How easy is it to use for other tasks? Are you allowed multiple copies at your site? How hard is it to configure the software for other tasks?
  3. The company producing the software invests in it. How can you tell?
    • How frequently does the company update their products?
    • Are the new items just bug fixes or new features?
    • Are the website documents and information up-to-date?
  4. Learning to use it won’t lead to job security.
    • Lazy programming results in users having to do busy work to make up the short-comings in the software.
    • Requiring users to perform a sequence of steps known to be needed for a specific task, when the software could do it for you is a sign that the software will continue to be a time sink.
    • Undocumented features, tricky combinations of switches, and other “features” that require IT’s time and effort to figure out are not adding value even if the product is lower cost. Instead, pick the competitive product that costs more but saves time. Time is the one resource IT has too little of and it is never renewable.
    • Knowing a product doesn’t help your resume, because recruiters want to see what you accomplished with the product. And by the way, so does your boss.
  5. The people that you interact with at the company supplying the product are enthusiastic about it.
    • When you talk to someone at the company do they merely try to bad mouth the competition, or are they genuinely convinced they have a better solution?
    • Does the sound honest and straightforward? For example, can they admit the competition has its pluses, but they can also give you some concrete examples of why they are better?
    • Does support get you the answers right away?
    • Is the knowledge base or FAQ up-to-date?

Rick Knowles is co-founder and VP of Technology for FilePilot Software is re-inventing file copying technology and experience for Windows, and NAS storage.

%d bloggers like this: