For as long as people have studied literature there has been a debate about how to define literature. What makes a text a piece of literature as opposed to non-literature? Can one meaningfully describe a book as high literature as opposed to popular literature, or is the boundary between these two categories a fuzzy one? A group of researchers (Max Louwerse, Nick Benesh, & Bin Zhang) have now approached this question from a unique direction, asking whether mathematical algorithms can be used to accurately differentiate literary texts from non-literary ones. A variety of different computational linguistic techniques (e.g., Latent Semantic Analysis, bigram frequency or the frequency of two-word combinations) were applied to a wealth of different literary and non-literary texts. Literary texts were based on a book by J. Peder Zane (The Top Ten, 2007), which polled 125 top authors from Britain and America on what they considered to be the top 10 works of literature ever published. Non-literary texts included such things as dialogue transcripts, articles published by the New York Times and Wall Street Journal, texts not categorized as literature by the Library of Congress Classification System (e.g., those classified as history, psychology, religion, philosophy, and geography, among others), and popular literature books in the form of Star Wars novels. What Louwerse and his colleagues found was that the various computational linguistic techniques were quite capable of accurately differentiating literary from non-literary texts, even when comparing highly-regarded literature to popular literature. You might be wondering what sort of word or word combinations were most useful for accurately classifying something as literature. It turns out that the two-word combination “and in” demonstrated the most success across the various methods and different texts. For example, “and in” demonstrated a classification accuracy of 86.3% based on a pool containing literary texts and Star Wars novels. In this particular comparison, high-quality literary texts were very unlikely to be classified as popular literature (only 4.2% of the sample) but the reverse could not be claimed as true: popular literature was erroneously classified as quality literature at least part of the time (40.5% of the sample). What one should take away from these group of studies is the interesting fact that superficial elements of complex narratives such as word (and bigram) frequency appear to hold some relation to more abstract and complex constructs such as high-quality literature. That said, what this relation means and how it should be interpreted is certainly difficult to evaluate. Gaining a proper understanding of literature is going to require more than scientific empiricism, but a diversity of methods and approaches from both the sciences and humanities.
Louwerse, M. M., Benesh, N., Zhang, B. (2008). Computationally Discriminating Literary from Non-Literary Texts. In S. Zyngier, M. Bortolussi, A. Chesnokova, J. Auracher (Eds.), Directions in empirical literary studies (pp.175-192). Amsterdam, Netherlands: Benjamins.