Dealing with Data
September 7, 2009  |  Data

I have been a fan of efficient and safe data analysis so I thought that I would take the time to discuss the approach that I take when analyzing large amounts of corporate data. When it comes to data analysis, bulk-data can be generated yearly, monthly, weekly, daily, hourly and so forth, but what is best? It really comes down to the time-sensitivity of the data and how it will be used. If you are looking to plan for major roll-outs, yearly or even monthly would be sufficient, but if you are dealing with customer complaints that need to be addressed quickly, daily and hourly are much more useful.

It is also critical to understand the quantity of data and the room for error. Unless automated, it is very difficult to analyze large data sets (one million+ records) on a daily basis without losing accuracy. Differently, yearly or monthly analysis can be largely error-free even when dealing with massive data sets as time pressures are generally minimized and more care can be taken.

From here on, let’s assume you are using a large set of data (one million+ records) on a bi-annual basis – time is not a huge factor, but accurate information is crucial. Let’s also assume that the data is coming in across many different text files (CSV, tab-delimited, etc). These text files have consistent headings and are only separated by geographic boundaries. Additionally, I will assume that storage space for back-ups is largely unlimited – AS IT SHOULD BE!

This may not be the optimal approach, but this is what I have done in the past.

Step 1. Make duplicates of ALL received data in consistent back-up locations. Organize in folders. For instance, “2010 Project Awesome” with a sub-folder for “Raw Text”.
Step 2. Import all data files into separate tables that are easily joined in your favourite database program. When done, LABEL all tables and back-up to “Raw Text” folder with a heading. I hope you are using something newer than MS Office 2003 so that the 2GB limit is not an issue. :)
Step 3. Join all of the tables to make one big massive table. This may or may not be necessary depending on the speed of your join queries.
Step 4. Keep the database with step 2 and 3 backed-up, but now it is time to trim some of your data. This is where normalizing the database is critical. You most likely have far too many columns so let’s remove those that have no relevance to the analysis (remember they are backed-up in the previous step if you really need them back). Also, if one column has a 10-digit phone number and the next 3 all have it split up in the 3,3,4 fashion, remove those 3 columns – this will drastically speed up your query speed, launch speed and just make your database easier to deal with. Also, change the labels of the columns to something more useful and memorable. Back-up!

Step 5. Clean and correct bad data. Do a brief check to ensure that all records have useful information. You most likely have some empty records and others with too much information missing so just remove those. Do NOT remove records that could be potentially useful as the pivoting of data will deal with most of this. Back-up!
Step 6. Alright, so now you should have largely clean data that is at least reasonably optimized. This is where I really hope you know your pivot tables as that is the best way to bring aggregate data into a spreadsheet such as Excel. If not, read up on database pivoting as it is a GREAT skill to have.
Step 7. Analyze in Excel as needed! If you need to grab some customer/record specific data go back to one of your back-ups and check it out. SQL is a must-know and comfort with the data and programs is also crucial.

I am sure that many of you would do a few things different and I encourage you to post in the comments so that others can develop their own process. After step 7 I would generally take geographic information and map it using GIS software. This would provide GREAT visual representations of geographical information that executives just eat up. It’s not terribly simple, but a bit of SQL knowledge and confidence in dealing with data goes a long way. I should also note that knowledge of business and HOW the data will be used is absolutely essential.

Problems I have encountered:

  • Very limited storage space. It is unbelievable that I have worked for one of the largest companies in Canada, yet they limited both local and server storage space. Optimizing and zipping data files can be great, but consistency and integrity can be lost.
  • Processing speed. Imagine processing two million record queries on a 4-year-old laptop. Yeah. Not fun.

I hope that I helped a few of you out with this and if you have any questions or are interested in hiring me (I do a LOT more than just this), send me an e-mail!


Leave a Reply