Disclaimer: This post is merely schooled speculation. All inferences were made in a matter of minutes to at most an hour without rigorous analysis.

Disclaimer #2: Warning. Contains numbers. L0ts 0f numb3rs.

Right. You’ve been warned. Let’s start.

Last week wasn’t exactly a code-less week for me. I coded, last Friday, albeit only a short equation consisting of a few lines or so of code, which nonetheless took me the better part of the morning to derive. It’s some sort of a ranking system but I wouldn’t talk about that. It wasn’t exactly complicated but I wanted to do it as purely mathematical as possible—that is, using as less programming constructs as I could—and that’s the difficult bit. I was never really one for maths.

To cut a long story short, the equation didn’t behave as I expected and I, again, spent the better part of Monday in front of Eclipse figuring out a way to do it. In the end, with the advice of one of my module mates, I settled with using percentiles.

My ranking system divided the [0, 100] range of the percentile into 6 partitions: [0, 5]. The first two partitions range over [0, 10] and (10, 20]. The remaining four partitions divided the range (20, 100] equally amongst themselves.  To test the function I coded, I wrote a small Python script that will randomly generate a data set for me. My Java code then analyzed the data set and produced some statistics.

Before I proceed, a little side story: I know that, speed wise, Java and C rule the world, and Python is somewhere down the continuum with Ruby. I value speed and efficiency but I’m currently fanboying over Python due to its very beautiful syntax. I admit to be at times deluded by this minimalist syntax into thinking that Python is faster than “sluggish” Java, with all that boilerplate-y syntax. With my experiment today, all my delusions have been shattered. My Python script, whose sole job was to print out a random data set*, took a lot of time; I can even watch as the lines printed. Java, who had to compute percentiles, analyze results, and generate a small report, executed in a flash.

*Although, to Python’s credit, it had to generate pseudorandom numbers (duh!). Wonder how much time, that takes.

I generated five data sets. The first three data sets had 101 items, the last two had 500. I know that my sample is small but I’m not there to gloss over statistics so they’d have to do. Anyway, the results surprised me.

The first three data sets produced the same distribution and so did the last two. What’s more surprising is that the distribution seemed too clustered; I expected a bell curve but I seem to have produced only values concentrated around a certain point.

Statistics:

Data set size: 101, for 3 runs:

Partition 0: 11
Partition 1: 10
Partition 2: 20
Partition 3: 21
Partition 4: 20
Partition 5: 19

Data set size: 500, for 2 runs:

Partition 0: 51
Partition 1: 50
Partition 2: 100
Partition 3: 100
Partition 4: 100
Partition 5: 99

(Remember that both Partitions 0 and 1 are only half the size of each of the other four partitions.)

I panicked a bit at this point. Was there anything wrong with my implementation? After all, this must be the first time I dealt with statistics in my code. I even asked the opinion of one of my module mates.

That’s when he pointed out that the data set was, itself, randomly generated and it is quite possible that the randomly generated data itself was already concentrated to begin with.

I used Python’s rand.randint method. I skimmed through the documentation and it seems that Python’s random library has functions specifically for generating normal distributions but randint doesn’t seem to be one of them. Hmmmm?

Edit (05/10/11): I won’t be delving into this further but I just realized that I must’ve expected the bell curve in my results too much. randint must be returning a uniform distribution, not the clustered part of normal. It’s a poor choice from the start. Should’ve used random.gauss or random.normalvariate instead.

[CLOSED]

(Though I also wonder if it’s alright to expect a bell curve from percentiles, given my data set size? Wikipedia tells me that “In general terms, for very large populations[,] percentiles may often be represented by reference to a normal curve plot”. Maybe, my data set just isn’t large enough.)