Some Data Gathering Resources

Hi everyone – today I wanted to put together a fairly quick post about some of the resources I’ve found in that past year that I’ve found interesting (and occasionally useful) in putting together some of the posts on this blog.  I’ve also found a lot of resources that I haven’t fully utilized (yet), but figured it might be useful to share.  Anyway, here you go:

Reddit Insight – “We downloaded the Reddit”

http://www.redditinsight.com/

Who doesn’t love Reddit when you’re looking for something to kill a few minutes/hours/days?  If you’re bored of Reddit, though, you can use this site to kill time while looking at data generated by and about Reddit.  Meta time killing, if you will.  There’s some cool tools, and who doesn’t like word clouds?

Aww subreddit word cloud

Wikipedia Statistics (Overall) – “WP:ST” redirects here.

http://en.wikipedia.org/wiki/Wikipedia:Statistics

There is a lot of information on Wikipedia, and there is also a lot of information about that information on this page.  I don’t know where to begin – this post could just be about this page.  How about the top 25 Wikipedia pages from last week?

Wikipedia Special Pages – “This page contains a list of special pages.

http://en.wikipedia.org/wiki/Special:SpecialPages

So you think that the last page had a lot of information about Wikipedia, and there’s probably not much more that’s really interesting enough to talk about?  Well, welcome to the sub-basement of Wikipedia, where people get together to generate lists of all sorts of things, like Long Pages, Orphaned Pages, and perhaps my favorite list on the internet, that of Uncategorized Categories.

Wikipedia Random Page – “Do you feel lucky, punk?”

http://en.wikipedia.org/wiki/Special:Random

This doesn’t generate any statistics on its own, but I think it’s an interesting page nonetheless.  You can certainly run some calculations based on the numbers from other Wikipedia pages, like what your odds are of finding the page you’re looking for on Wikipedia by simply clicking that link (it’s about 1 in 30 million).

So if you’re feeling like learning something, give some Wiki Roulette a try.  

Who knows, someday it might be important that you know something about David Alton, Baron Alton of Liverpool.

Wikipedia Pageview Stats – “How about you tell me in graph form?”

http://stats.grok.se/en/201306/wikipedia

I swear that a few years ago Wikipedia had some tools built into their own site to look at stats from specific pages, but recently this page has been all that I could find.  It’s great if you’re looking for patterns in data, like if people are more likely to look at articles about particular days of the week on those days of the week.

 

Google Trends – “Two trends enter, one trend leaves”

http://www.google.com/trends/topcharts

Google Trends seems to have two main things going on.  The first is the stuff on the main page, which is letting you know what’s trending on Google.  Personally, that’s really quite boring.  The fun part of Google Trends is pitting two (or more?) topics against each other to see how search volume has compared over some space of time.  For instance:

http://www.google.com/trends/explore?q=red%2C+blue#q=red%2C%20blue&cmpt=q

Produces a great graph of search volumes of “red” vs “blue”:

Yes, I know (and you should too – that’s what legends are for!) that red is the blue line and blue is the red line.  It’s the order they are put in that matters, and I find the prospect of them switched to be amusing for some reason.  So, I’m keeping it.

That said, looks like red started winning sometime around 2008.  Some of those letters on the graph might help you pull out why that is, as it links time frames to news stories and the like.  It also shows searches that contained these words, etc.  It’s a fun tool, especially for those of us who always wondered why people thought it was so hard to compare apples to oranges.

Google Correlate – “Correlating your Googles”

http://www.google.com/trends/correlate

This one is a bit newer to me, but has some cool applications.  It lets you see what words are searched for together, or rather which search terms are correlated to any given search term.

For instance, a search for “turkey” reveals that people are very often searching for “turkey stuffing”.

You can also export these results, and do some other stuff with it, I guess?  Like I said, this one is relatively new to me, so I’m in the process of thinking of ways to try to use it.

Professional Football (NFL) Stats (1940 to present) – “All the games, and then some”

http://www.pro-football-reference.com/

If you’ve been reading the blog for a while you recognize this site, as it’s the one that I used to look at historic scores in the first week of the season.  It has all the outcomes from every NFL game ever played, and some stats from games even before that.  There’s…a lot of information.

Professional Baseball (MLB) Stats (1916 to present) – “Like football, but baseball”

http://www.baseball-reference.com/

While I’ve never used it, you can also find stats on all professional baseball games played in the last century or so.  Again, it’s a lot of information.

Professional Hockey (NHL) Stats (1987? to present) – “We taped over all the early games”

http://www.hockey-reference.com/

Finally, there’s also a professional hockey version of the last two pages, though for some reason it only extends back to 1987.  Maybe they’re working on it?  Hard to say.

Twitter Stats – “A constant, never ending stream of information”

I actually had a decent amount of trouble finding any official stats on the Twitter page, as it looks like almost all of the stats generating tools are third-party.  Maybe that’s not the case, and I’m looking in the wrong place.  If I was going to list one third-party Twitter tool I’d rather list a ton, so maybe that will just wait until a future post.

More on the importance of exponential growth OR that part in Wayne’s World

Yes, I’m talking about this part of Wayne’s World:

It seems that over the past few posts I’ve touched on exponential growth from a few different directions.  One of those ways was relating to the proliferation of unique Tetris pieces you can make with a set number of 1×1 Tetris blocks, and the other two were touching on the Wayne’s World social network method from above.

Those two posts were the one about my Kickstarter project and the one about saving the post office by creating a national culture of everyone removing and returning business reply mail envelopes from junk mail.

Let’s get the obligatory Kickstarter plug out of the way.  The reason that Kickstarter works (when it works) is due to the nature of social networks.  If I just wanted to collect money from the people I knew, I’d simple ask every person I knew if I could borrow a dollar the next time I see them.

It’s an interesting social experiment, and perhaps one I’ll try sometime, but it’s not the point.  The point isn’t for people I know to give me money (though thanks if you have!), but for them to tell the people they know.
You see, I might be able to say that I ‘know’ a few hundred people.  People who if I saw them sitting at a bar in an airport while traveling I would sit down and strike up a conversation with (my favorite test of if you actually ‘know’ a person).  I don’t know that I’m too much of an outlier on that – keep in mind I’ve said ‘know’ and not ‘friends’, which is a whole different story.

If each of those people gave me a dollar (from the above example), I’d have a few hundred dollars.  But if those people instead just told their friends about me, and I got a dollar each from them, well, that’s a lot more dollars.

How many more dollars?

Well, let’s just make this easy.  Let’s say I have easy communication access (can I put that in any colder terms?) with 100 people.  Let’s also say that they also have 100 people with whom they share the same access, but those people are 100 different people (I guess I’m in there, too, so maybe they need to have 101 people).

Regardless.

If I did somehow find myself in a situation where I knew 100 people who each knew 100 non-overlapping people, that second set of known people is exponentially larger than the first.  Why?  Because there are exponents involved.

Joking aside, exponential growth occurs when the growth in a mathematical function is a product of the current value of the function.  In ideal case, when y = n^x, and where n is some number.  (Yes, I also know that the exponential function – not just exponential growth involves the use of e^x, but that’s outside this discussion.)

What I’m getting at is that the number of people in this secondary network is 100*100, or 100 squared (100^2).  100^2 is 10,000.

That’s not the function, that’s simply one step along the way.  There is a function is how the number of people in the primary network (the people I know) are related to the number of people in the secondary network (the people the people I know know).  That function is a simple square: y = x^2.  This still isn’t where exponential growth comes into play, but it’s worth discussing first.  A square function (in fact, this square function) looks like this:

If the number of people that people know is 100, then we get the 10,000 above.  If each person knows 10,000 people, then the secondary network is out at 100 million.  If each person knows 3 people, then the secondary network is only 9 people.

If you happen to have friends who also like telling people things (and also happen to miraculously have a completely unique set of friends aside from you), we would move out to a cubic function: y = x^3.  Now, the number of people that can be reached if everyone has 100 people to talk to is 100*100*100 = 1,000,000

That’s right, one MILLION people.

By increasing from a secondary to a tertiary network, we’re incrementing the value of the power in the function.  It is though this that exponential growth occurs.  The Wayne’s World growth is every person telling two friends, so the function y = 2^x shows how many people you are contacting at that stage of the process (e.g. x = 3 is three steps steps removed from the initial person).  That looks like this:

Don’t be fooled by the scale into thinking that those numbers below 75 on the x-axis are zero.  They’re just really small compared to the end number, but they’re still really big.  For instance, 2^25 is still 33 and a half million.  The fact that 33,500,000 looks like zero might give you some perspective on just how big those numbers toward the right end of the graph actually are.

The graph of what we were talking about above, with every person telling a hundred friends is somewhat similar, except all the numbers have a lot more zeros on them.  In fact, since we’re working with a nice power of 10 system all the numbers can simply be expressed very easily in scientific notation.  So much so that a table is perhaps more illustrative than a graph.

So, if I told 100 friends about something, and then each of them told 100 (different) friends, and so on, and so on, we’d run out of humans on the planet sometime between the 4th and 5th step.  The trick would really be finding those 100 unique people each time.

You might also note that this is how pyramid schemes work, and why they are always (eventually) unsustainable.  To keep the scheme going you need to keep finding unique people to enter into it.  The longer it goes on the more and more unique people you have to find.

For instance, here’s the table for the Wayne’s World 2^x method:

So, even if you’re just having each person in the system tell two other people, you still run out of people on the planet in about 33 steps through that system.  Of course, an actual pyramid scheme is a bit more complex than this, but this would illustrate one running at peak efficiency.

Let’s step back from pyramid schemes for a moment.

Think of it this way.  If two of you each told two friends about the post office plan from last week, and those two people told two people, etc, we’d have the whole US told (again at peak efficiency) at just under 30 or so steps.

Back to pyramid schemes though (I’m kidding), we don’t need the whole of the country on the kickstarter – this process at 10 steps still has over a thousand page views.

So, you know, do both those things.

Tetris pieces are growing in perhaps a much more interesting way, that I’m only going to touch on briefly (until I decide to do a post that looks at those 6 and 7 block cases).  I talked about it during that post, but every time you add a block to the system you can place that block on a number of spots on preexisting pieces.  Early in that process you a) have fewer pieces and b) those pieces are smaller.  The growth that occurs at that stage is slow, then.

As you start to get more pieces, and those pieces get bigger, there are both more spots on any given piece to put a new block as well as more pieces on which to do so, which drives this accelerating growth.  Some of these aren’t unique, but it’s possible that the proportion of non-unique pieces produced at each step has a predictable function as well.

Something to look at later.

Or perhaps something to dream about

How to save the post office (and stick it to the man)

Here’s a question for you.  Why won’t UPS or FedEx come to your door every day, pick up any mail you might have, and deliver it to any other address in the country – hold on, I’m not done – for the price of 50 cents or so per item?

The short answer is that it’s simply not cost effective, at least without a huge customer base and established infrastructure.  
Even then, it’s still a challenge.  The USPS knows that well.
The USPS was founded on the principle that mail service is a right of everyone in the country, and that prices should never become burdensome as to exclude anyone from that process.  Even if you live on top of a mountain, or in the Grand Canyon, the USPS will still bring you mail for just the price of postage.
The problem is that this system works best when it’s running near or at capacity.  There’s a lot of infrastructure in place, and as we send less mail in aggregate the costs of running the system don’t really decrease proportionally.  
To pump money back in to the post office, then, all we need to do is send more mail.
Well, duh, you’re saying, but that costs money.  And you have to sit down and write stuff.  It’s soooo 20th century.  
Sure, it is.  While I’ll stand by the fact that people should write more letters, I’ll agree that this isn’t the solution.  At least not the solution we’re looking for *waves hand*.
You might not want to throw money at this, but there are plenty of those that do.  In fact, there’s a good chance that they’ve mailed you some money today.  What are you likely to do with it?
Tear it up and throw it into the garbage.  
You see, I’m talking about business reply mail.  Yep, this stuff:
You see, companies pay to send you envelopes full of solicitation, including these business reply envelopes.  The trick is that they pay in bulk (and get a discount), and also only pay for return postage (also discounted) when those envelopes are returned.  
It’s a pretty safe bet on the return mailers – they’re happy to pay for them because when they do come back they’re filled with what could basically be gold: filled out credit card applications, uh, other filled out credit card applications, uh, etc.  
Some of you are a few steps ahead of me already, I can tell.  What I’m about to suggest isn’t a new idea – I’m confident that a good chunk of the country has independently derived this on their own.  It’s not tricky, and you can find plenty of people already suggesting it on the interwebs (which makes our eventual job easier).  
By simply mailing these envelopes back – empty – you are in effect taking some amount of money from these corporations and surreptitiously donating it to the USPS.  
Before you say “I already saw that a bunch of places”, let me again point out that a quick search reveals tons of people who have also come up with this idea to varying degrees.  What I’m looking to figure out today is actually how much this donation to the USPS would actually be if we all started doing it.  
So, I’ve been counting my mail.  
It’s probably not shocking to anyone who a) is alive, b) lives in the US that I (we) get a lot of this type of mail.  There are some days I don’t get any, and some days where I get a bunch.  On average it seems to work out to about one business reply envelope a day (days that I get mail, so discounting Sunday).  
Let’s err a little lower to keep things nice and round, and say that I get 300 business reply envelopes a year.  
It’s kind of hard to figure out exactly how much it costs to cover the cost of a returned business reply mail envelope, as the USPS has some information here: 
But seems to hold back on pricing information until you try to do it.  I’ve had trouble finding anything else on their site about what sort of discount actually takes place, so lacking that information we can simply work on the known bounds.  
That is, we know the most and least that one of these business reply mail would cost to cover.  The least is nothing, if the post office is simply in collusion with the companies and not really worried about losing money.  That seems pretty unlikely.  
The most that they could charge is something less than the price of a stamp, or it wouldn’t be a discount.  The current cost of a stamp is $0.46.  That means that the upper bound of what I could be ‘donating’ to the post office in a year is somewhere around $138.  If companies are getting a 50% discount on mailings, we’re now talking $69.  If they’re getting a 75% discount it’s only $34.50.
$34.50 might not seem like much, and in the grand scheme of fixing the post office it’s really not.  The way around this is the law of large numbers.  You can still see from the same Google search I told you to do earlier that this is not a highly unique idea.  This is a very easy idea to develop independently and simultaneously.
In the latest US census, 76.5% of the 313,914,040 people in the country were over the age of 18.  I know for a fact that you can get this type of mail well before you’re 18, but for simplicity’s sake lets just go with those who are right in the target market for this kind of mail.
That leaves us 240,144,241 people who are likely to be receiving some number of business reply mail envelopes in the mail.  But how many?  
Well, I don’t think the numbers that I’ve found for myself should be anything outlandish.  I go to lengths to make sure that companies *don’t* have my address, so if anything I should be on the lower end of the scale.  
Let’s simply say, though, that I’m presumably somewhere around average (if you don’t believe me, start counting your mail).  What would that mean?
Well, it would mean that instead of me chipping in something like $34.50 a year, 240 million people could be.  
If you have a basic grasp on math you can see that we have a two digit number that’s going to be multiplied by a number that has run out of millions digits.  That means we’re now talking billions.  
$8,284,976,314, to be exact.  
Everyone online seems to have a different number for the USPS budget shortfall each year, but they mostly seem to fall between 5 and 11 billion, which means that 8 billions dollars could actually make a dent.  
That’s also operating on the presumption that these companies get discounts as high as 75% on business reply mail returns.  It could be higher than this, I’ll admit, but it could also be lower.  If they were only getting a 50% discount we’d be looking at $16,569,952,628.  With no discounts whatsoever we’d be looking at a cool $33,139,905,256.
It’s easy to read that and say, ‘yeah, but that’s if everyone does it, it doesn’t matter if I do’.  
Well, it does, because you’re part of everyone.  Honestly, make this a habit.  Instead of just tearing up and throwing away your business reply mail return envelopes (you should be recycling them anyway, jerk!), make a pile of them and then recycle the rest of the paperwork.  
Have fun with it, save them up and send them all out on the first of the month or something.  Pick a day of the month when you pay your credit cards and take a bit more delight in the fact that you got something back out of it, too.  Well, at least the post office would.  
Some people will tell you to get all spiteful about it, and mail other junk mail, or crackers, or ez-cheese, or bricks, or other things that are just not a great idea to be sending through the mail.  Don’t make this about anger, make it about release.  You’re getting rid of something you don’t want, and helping out an organization who needs it.

Some other people will tell you to do this so that the credit card companies will stop sending you business reply mail envelopes.  Sorry to burst your bubble, but they’re never going to stop.  If every single person in the country was doing this every day they *might* start to notice.  $8 billion spread across all the companies that send you this type of mail is still the equivalent of a mosquito feasting on the ankle of a giant.  
But seriously, this isn’t hard.  Do it. 

On Kickstarter

Hi everyone – today we’re going to talk about Kickstarter.  We’re going to talk about Kickstarter because I’m currently Kickstarting a Kickstarter project related to the blog.  Kickstarter.

Anyway, for those of you who like to just get to the point right away, here’s a link to the project.
or short form for easy linking: http://kck.st/1514HKO
Take a moment to go check it out, and if you’d liked some of the stuff on here do consider tossing a few bucks at it.  If you’ve never used Kickstarter (and don’t have an account) take note that it’s actually quite easy to get started with.  Check out my project, check out some other projects – there’s always something good on there.  
Also, pass along the word to anyone who will listen.  I’ve set the initial funding bar low, but have some fairly crazy stretch goals in there.  The more people who fund the project, the more stuff everyone gets out of it.  I doubt I’ll hit them, but who knows.
So, go check it out, and spread the word!
If you want to talk a bit more about Kickstarter, though, that’s what the rest of today’s post is about.
Kickstarter was set up a few years ago (2009) as a means of crowd-funding projects (not to be confused with cloud-funding projects).  People put up projects, and other people can support those projects to help get them off the ground.  If the project is funded the person who created the project gets the money pledged, and if the project is not funded no one is charged anything.  
Kickstarter keeps a page on their website about stats, and they seem to update it daily.  It’s actually a pretty cool resource, and you can find it here:
You can see that Kickstarter recently passed the 100,000 projects launched mark, and they also just passed the point where they have successfully raised 600 million dollars.  Yes,
$600,000,000.00
So, they’re doing alright.
There are also some cool numbers about successful and unsuccessful projects.  There have been around 45,000 successful projects and around 57,000 unsuccessful projects.  
The successful projects that have raised a ton of money ( http://www.kickstarter.com/discover/most-funded ) are kind of few and far between.  The majority of projects are just small things that raise less than $10,000 – at the moment right around 77% of successful projects raised less than 10K.  
The unsuccessful projects, though, well, a lot of them just seem to sit and languish without anyone ever pledging anything.  Nearly 1 in 5 of unsuccessful projects never get a single pledge.  
That could mean that those were just poorly constructed projects, etc, or that there’s some sort of momentum that builds as a project gets going.  No one wants to be the first person to back a project, right?  If no one else is doing it, it must not be good?  It would be pretty interesting to actually find a way to measure some of that, but for now all we can really do is speculate.

I’ve never been a first pledge on a project, so I’m not really sure what it feels like.  Whoever pledges first on this one, what was your thinking?  Do you wish there were comments so you could yell FIRST?

Kickstarter does point out that once a project gets past 20% funding it is more likely to succeed than to fail.  There are a lot of projects (over 46,000 as of today) that fail while failing to reach that 20% threshold.   
So, that’s what I’m hoping for today.  20% of this project is only $50, so we’re talking baby steps.  One person could do that on their own, though I’m not trying to say that one person alone needs to.  If we can get to 20% today then we can just let momentum carry things for a while.  
Though, it would also just be great to fully fund it, too.  That would make things nice and easy.  
So why are you still here?  Go check out the project!

Tetris Pieces, Exponential Growth, and Unexpected Cryptography

I talked about Tetris a few weeks ago (here), and examined the counts of pieces across a number of games. That got me thinking about Tetris pieces, especially when I started to look at some of the Tetris variants that people have programmed.

One of the things that lingered in my mind was the idea that all Tetris pieces are made up of four 1×1 blocks.  I’ve played some variants over the years (many of them official) that did some other things with these four blocks, like allow for corner-to-corner connection (unacceptable!) instead of strict face-to-face connection, but none to my recollection that changed the standard use of four 1×1 blocks per piece.

So what would happen if we changed around this simple building block (haha) of the Tetris franchise?

The simplest Tetris game would be a situation where each piece was made from one 1×1 block.  When I say simplest, I’m not sure I can express that enough.  You just get the same piece over and over, and try to build lines with it.  It’s basically two-dimensional Minecraft where the pieces fall from the top and sometimes disappear.

The two 1×1 block case is hardly different, actually.  There’s still just one piece, though at least now rotation matters in game.  Not that much, though.

The three 1×1 block situation starts to get a little more interesting.

You still only end up with two distinct pieces – a hooked piece and a straight piece.  It might actually make for a fairly interesting (if not somewhat simpler) game.  Tetris Jr., perhaps.  Someone code this.

A situation with four 1×1 blocks is the Tetris we know and love, and produces seven pieces.

You should be familiar with these, but you can also see that they’re basically just an extension of the pieces in the three 1×1 situation.  You can add a 1×1 block to the first piece (in the three 1×1 case) to make the square (O) block.  You can also add a 1×1 block to that piece to make the S, Z, J and L pieces.  You can make the T out of any of the pieces, but the only way to make the straight (I) piece is to add on to the straight piece from the prior set.  It’s a rather simple but interesting point.

So we already have a pretty interesting set of numbers here.  The first two cases result in a single piece, then we move to two, then to seven.  We’re more than doubling with the addition of one more 1×1 block because each piece already able to be made can produce a number of new pieces.  The more complex those pieces, the more places a block is likely to produce a new unique piece when placed.

Let’s see if we can build a set of Tetris pieces with 5 1×1 blocks.  By that I mean, I’m going to spend some time doing that in Excel, and the next thing you’ll see is that completed.

So there you go.  That might have seemed quick for you, right?

I worked each piece separately to make sure I was covering all bases, and didn’t create duplicates within the same base piece (from the four 1×1 level).  I shaded pieces red if they duplicated a piece already made by a previous base piece.  That leaves us with 18 unique pieces, which we can see a bit clearer in this picture.

There you go.  Someone, make this game.

There are two things that stand out to me.

First off, there’s a lot more pieces, but still only one straight piece.  Since it simply is what it is (all the 1×1 blocks you have to work with, in a line), there’s only ever going to be one while the rest of the pieces keep expanding.  If you thought you were waiting a long time for a straight piece in the four 1×1 version you’d hate a version like this.

You could look at it this way:

1×1 blocks
I pieces
Percent of whole
1
1 of 1
100.0%
2
1 of 1
100.0%
3
1 of 2
50.0%
4
1 of 7
14.3%
5
1 of 18
5.6%

That’s only going to get worse, though I’m not going to go past five 1×1 blocks given how much more work the six 1×1 case is going to be.

The second thing that stands out to me, though, is how close some of these pieces look to actual letters.  There are also 18 of them, and while some of the duplicates fail the rotation test in Tetris, they would look distinct enough on paper.

In fact, we’d only need eight reflections to get up to 26, which is a pretty interesting number because at that point we can build a substitution cipher with the English language.  So…let’s do that.

I didn’t really expect to end up with a substitution cipher when I started this post, but I suppose time makes fools of us all.

The nice thing about this is that each letter can fit in a 4×4 square, allowing for even further ciphering by expressing those letters not as letters but as a string of numbers.  Off the top of my head you could easily do that one of two ways.

The first would be expressing a letter as a string of numbers based on which squares were filled in.  In that way, ‘a’ would be 6/9/10/14/15, or 69101415.  The information can be parsed out without explicit spacing due to the lack of any letters in the cipher using the ‘1’ square.  If there’s a 1 in the number it means that it – and the next number following – are part of a two digit number.  In that way you can easily break 69101415 into 6, 9, 10, 14, 15.

The numbers also don’t need to be in order, so the letter would be just as preserved written as 10149156.

Each letter could also be written as a 16 digit binary number (realistically less if we figured out which squares were never used).  In that way, ‘a’ would be 0000010011000110.  If you read my post on binary math you know we could easily translate that number into base 10 and come up with 1,222.

I think I’ll revisit this one in a few weeks, as I think there’s some other stuff we could do with this.  For now, though: