I am beginning a new data collection project that requires the manual coding of data collected from various sources in print and online. As I start this project, I am tasked with how to build a master record of all the data I collect in the process. I have worked on projects that used extensive paper coding forms that were later filed away only to be retrieved when appropriate. This serves as a safeguard to both checking original coding decision, errors in the database, and any other information the coders found while researching the topic at hand. Alternatively, other projects had an evolving excel spreadsheet that itself was the master record – duplicate copies served as a safeguard against accidents while the final form only existed when the researchers stopped coding. Finally, projects that are entirely automated tend to generate their own database that the researchers can then use to extract useful information into a final dataset for examination.
For this project, I decided to make a separate, master database that then will be used to generate data sets as needed. Any changes will be documented in the master set and stripping out coding variables (last updated, side notes, etc.) that are otherwise irrelevant to people using the data. Thus, the final versions will be tab delimited for general consumption, produce in Stata, and a R version while the master remains in a different format for official changes.
More after the jump; those interested, there are screenshots and I attempt to elicit ideas for more efficient mechanisms to collect and store data…
My first idea was to make an online form accessible via my personal website
(with a hidden URL and other protections to prevent random spambots and malicious users attacking and adding information), but
this idea slowly died with two realizations. First, I do not know
MySQL or any other appropriate website back-matter to handle the form data and put the form entry into
the appropriate cells. Second, I really do not need an online form if
all my data collection is done from a central computer (a laptop that I
can take with me to research). Thus, learning MySQL would be an
unneeded task on top of the other "languages" I am learning this
summer. There are a few programs that can do it automatically, but
they seem to require a monetary investment – more than I want to
contribute for the project.
The second solution is the process I am attempting right now. Lacking the ultimate version of Office 2007 for Access, I installed openoffice for the first time. Base is an Access clone that will hopefully fulfill my needs as I slowly learn the quirks and demands of the software. So, the first project is to build the necessary cells with information on what contains the cells. Using the normal field designs and entering the established fields sequentially, I have (click any below picture for a full image view):
Initially I had more variance in the field types, but had problems with the date field (kept returning today’s date) and with binary fields failing to show up in the forms I later created. So, defining the fields was not a special process in excel, but the utility is I can now create a form for manual data entry into the spread sheet:
A very appealing and intuitive form to fill out and later recall for single record manipulation. Allows review and hopefully will decrease the mistakes that come from linear entry into a spreadsheet.
The result is a nice master database that is slowly building (click for full table):
This is my current start. Now to elicit some feedback. Primarily, what methods have you used in the past (or currently use) to permanently store data that is collected? I obviously believe in having a separate, master database that remains original and does not contain any subsequent manipulations by researchers for both verification and replication. Are there superior programs for what I am doing? Other and all thoughts are welcomed.
I published an article in “The Political Methodologist” where I describe how to do this with a Web form, MySQL, etc. If you enter all of the data yourself, the solution doesn’t matter too much; I use Excel for smaller ad hoc databases. However, if you want to have someone else enter data, even just one person, a Web solution is much better (security, control, privacy, backups, etc.) If you want more than one person to enter data, a Web solution pulls far ahead. MS Access and clones are not designed for simultaneous users; even if they claim to be able to support simultaneous users, they don’t do it well: the interface slows to a crawl, and data corruption becomes a real danger.
It is not hard to get going using a Web form even on your own laptop. You can get solid, free software that makes it easy. XAMPP (http://www.apachefriends.org/en/xampp.html) is an incredibly simple way to get a Apache, PHP, and MySQL running on anything — it took me about 10 minutes to get it running on Windows. Once you have that, you can use the template I provide (http://haptonstahl.org/srh/papers/2006/webdata/) and start collecting data quickly.
Start with the article: http://polmeth.wustl.edu/tpm/tpm_v15_n2.pdf It’s worth a look.
I think this serves as a safeguard to both checking original coding decision, errors in the database, and any other information the coders found while researching the topic at hand.