Thinking about Size, Scope, and Samples–And Music

My project is going well, but it has not been without a few pitfalls. Per usual, the process of actually doing the project precipitated a rethinking of a number of my design decisions. The issues all have to do with either size and scope or whether or not representative samples are important.

1. Size: I have decided to venture into the world of topic modeling. I think it, more so than any other mode of analysis, is essential to getting what I want out of  my analysis. The problem, though, is that I have read that topic modeling is best when used on an extremely large dataset. One manual-esqe style article suggested that one would need at least a thousand documents for topic modeling to really be effective. Needless to say, I do not have a thousand documents. Currently, I’ve cataloged only three hundred blues lyrics and three hundred old time lyrics, but because I am going to be comparing the two, they are mutually exclusive in terms of a combined document count. I could increase my collection of lyrics but doing so would take quite a bit of time and cause me to alter the projects scope.

2. Scope: I initially decided that I would restrict my search for lyrics to artist born prior to 1900, very much what I took to be the founding generation of commercially successful Southern musical artists. The thought behind this decision was if I could categorize artists based on generation, I could then potentially track change over time if I decided to expand my project forward. I now realize, however, that this logic has quite a few flaws. For one, what about the “tweeters?” And what I mean by that are those artist born right on the cusp of  1900. For instance, is Blind Willie McTell, born 1898, and Son House, born in 1902, really a part of two different generations just because their births fell on opposite sides of the century? In hindsight, I say no. Another problem is that I naively believed that those born after 1900 might not be a part of the same artistic community as those born prior to 1900. Of course, I now realize that age has little to do with whether or not an artist will become popular and when. The popular artists of the 1920s comprised of men and women of varying ages.

3. Sample: Another decision that I made, perhaps in too much haste, was to try to make my blues samples congruent with my old time samples. What I mean is that I wanted to have an equal number of  blues songs to old time songs from an equal number of artists. This endeavor, though, is next to impossible. The blues have been much more accessible than old time country. And naturally, some artists were much more popular than others, creating an imbalance between what will be accessible and what is not.

I am somewhat torn about what to do if I wanted to expand the project further than what is required in the class. I think that the project’s comparative nature is what is most important, but, at the same time, topic modeling both and doing it correctly would require a much larger set of data. Therefore, if I were to expand it, I am very tempted to just analyze one of the two genres. Doing so, though, would eliminate its comparative element.

Image URL: http://ccriderblues.com/wp-content/uploads/CC-Rider-Banner-2014-DRAFT-Mar5.jpg