One Million Unique Global English Words. That’s All?

The website The Global Language Monitor as of December 21st states there are 1,002,116 “Global English Words” in use today. Additionally it says:

  • English passed the 1,000,000 threshold on June 10, 2009 at 10:22 am GMT
  • English gains a new word every 98 minutes (or about 14.7 new words a day)

According to the book One Million Words and Counting by Paul J.J Payback

“In the new world, over 1.35 billion people now speak English. It’s been outsourced to India where 350 million people speak their own variant, Hinglish (fundoo, propone), and to China where 250 million people speak Chinglish (no noising, drinktea). It gets new words virtually every day from every corner of the world and every facet of human existence—from politics (locavore, punditocracy), entertainment (truthiness, brokeback), youth culture (crunk, word!), corporate culture (multitask, scalable), science and technology (quantum, nanotechnology), and the Internet (blogosphere, ROTFLOL).”

Maybe I’m a product of massive amounts of information being generated and consumed by the likes of Facebook, Twitter and other web mediums, but one Million doesn’t seem like that big of a number to me. Furthermore, why is it then that massive amounts of storage keeps getting consumed by databases and other forms of structured data storage like logs, text messages and alike? Obviously the reason is that combinations of these words and numbers are being used in differing patterns and transactional records that represent the unique data items that need to be stored for historical business and compliance purposes. And more likely, many of the more common words are repeatedly used and not visibly unique, much like the mass of penguins illustrated in the photo above.

Traditionally RDBMS’ such as Oracle have been the mechanisms by which discreet data values have been separated into various cells (rows and columns) for easy indexing, retrieval and updatability. What this has meant however is that every value needs it’s own unique storage space and consequently the size (and indexes) of these repositories grow proportionately to the data being loaded into them.

With RainStor (disclosure, my current company), a new specialized repository unique data values are stored once and only once. Then referred to through pointers to those values and patterns. It’s a method of storage so efficient that potentially if data being loaded contained those 1M+ unique Global English values, any combination of representative rows thereafter could be added to the repository load without proportionate increases in storage capacity. At this steady state efficiency, you could imagine vast quantities of highly repeated structured data being stored in a fraction of the footprint it would take a regular RDBMS.

Take this simple example where the data below is contrived for the purposes of illustration:

John James buys 200 shares GOOG at 10:00am Dec 21, 2009
John Smith buys 100 shares IBM at 10:00am Dec 22, 2009
Jane Smith buys 100 shares IBM at 11:00am Dec 22, 2009
James Smith buys 200 shares GOOG at 10:00am Dec 22, 2009

Typically this would be stored as 4 records in an RDBMS with columns representing each data type. With RainStor, unique values would have been stored once and only once the first time, and the BOLDED values represents duplicates which do not need to be stored a second time.

John James buys 200 shares GOOG at 10:00am Dec 21, 2009
John Smith buys 100 shares IBM at 10:00am Dec 22, 2009
Jane Smith buys 100 shares IBM at 11:00am Dec 22, 2009
James Smith buys 200 shares GOOG at 10:00am Dec 22, 2009

Obviously this is a contrived condensed example, but such patterns and values are evident in large volumes of transactions such as stock trades, call data records, text messages and even Twitter tweets.

So visually the previous image of thousands of penguins might be simplified dramatically:

With RainStor 1M+ unique Global English Words isn’t that big if it is the total sum amount of actual data that needs to be held, while patterns and variants of records are captured based on combinations of these values. The resulting storage savings means more efficient operating costs and longer guaranteed immutable and compliant retention of business data.

Leave a Reply