Probability Binning: simple and fast is better than complicated and slow

Recently, I've done a few data science coding challenges for job interviews. My favorite ones included a data set and asked me to address both specific and open-ended questions about that data set.

One of the first things I usually do is make a bunch of histograms. Histograms are great because it's an easy way to look at the distribution of data without having to plot every single point, or get distracted by a lot of noise.

How traditional histograms work:

A histogram is just a plot with the number of counts per value, where the values are divided into equally-sized bins. In the traditional histogram, the bins are always the same width along the x-axis (along the range of the values). More bins means better resolution. Fewer bins can simplify the representation of a data set, for example if you want to do clustering or classification into a few representative groups.

A histogram with ten bins:

Screen Shot 2016-11-03 at 11.08.45 AM.png

The same data with 3 bins:

Screen Shot 2016-11-03 at 11.08.52 AM.png

Original implementation:

First, I used matplotlib to get the bin ranges, because that was easy. Then I applied those as masks on my original dataframe, to convert the data into categories based on the bin ranges.

    def feature_splitter(df, column, bins=3):
        """
        Convert continuous variables into categorical for classification.
        :param df: pandas dataframe to use
        :param column: str
        :param bins: number of bins to use, or list of boundaries if bins should be different sizes
        :return: counts (np.array), bin_ranges (np.array), histogram chart (display)
        """
        counts, bin_ranges, histogram = plt.hist(df[column], bins=bins)

        return counts, bin_ranges, histogram

    def apply_bins_as_masks(df, column, bin_ranges):
        """
        Use bin_ranges to create categorical column

        Assumes 3 bins

        :param df: pandas dataframe as reference and target
        :param column: reference column (name will be used to create new one)
        :param bin_ranges: np.array with ranges, has 1 more number than bins
        :return: modified pandas dataframe with categorical column
        """

        low = (df[column] >= bin_ranges[0]) & (df[column] < bin_ranges[1])
        med = (df[column] >= bin_ranges[1]) & (df[column] < bin_ranges[2])
        high = (df[column] >= bin_ranges[2])

        masks = [low, med, high]

        for i, mask in enumerate(masks):
            df.loc[mask, (column + '_cat')] = i

        return df

This worked well enough for a first attempt, but the bins using a traditional histogram didn't always make sense for my purposes, and I was assuming that I'd always be masking with 3 bin ranges.


Then I remembered that there's a different way to do it: choose bin ranges by equalizing the number of events per bin. This means the bin widths might be different, but the height is approximately the same. This is great if you have otherwise really unbalanced classes, like in this extremely simplified example, where a traditional histogram really doesn't always do the best job of capturing the distribution:

Screen Shot 2016-10-24 at 10.08.35 AM.png

When to use probability binning:

Use probability binning when you want a small number of approximately equal classes, defined in a way that makes sense, e.g. combine adjacent bins if they're similar.

It's a way to convert a numeric, non-continuous variable into categories.

For example, let's say you're looking at user data where every row is a separate user. The values of specific column, say "Total clicks" might be numeric, but the users are independent of each other. In this case, what you really want to do is identify categories of users based on their number of clicks. This isn't continuous in the same way as a column that consists of a time series of measurements from a single user.

I used to do this by hand/by eye, which is fine if you don't need to do it very often. But this is a tool that I've found extremely useful, so I wanted to turn it into a reusable module that I could easily import into any project and apply to any column.

The code I wrote is here

The actual process of getting there looked like this:

Step 1: create an inverted index

Step 2: write tests and make sure that's working

Step 3: use plots to verify if it was working as expected (and for comparison with original implementation)

For the simple case yes, but on further testing realized I had to combine bins if there were too many or they were too close together.

Step 4: combine bins

Step 5: use the bin ranges to mask the original dataframe and assign category labels

    def bin_masker(self):
        """
        Use bin_ranges from probability binning to create categorical column

        Should work for any number of bins > 0

        :param self.df: pandas dataframe as reference and target
        :param self.feature: reference column name (str) - will be used to create new one
        :param self.bin_ranges: sorted list of new bins, as bin ranges [min,   max]
        :return: modified pandas dataframe with categorical column
        """
        masks = []

        for item in self.bin_ranges:
            mask = (self.df[self.feature] >= item[0]) & (self.df[self.feature] < item[1])
            masks.append(mask)

        for i, mask in enumerate(masks):
            self.df.loc[mask, (self.feature + '_cat')] = i
            self.df[self.feature + '_cat'].fillna(0, inplace=True) #get the bottom category

Step 6: try it in the machine learning application of my choice (a decision tree - this will go in a separate post). Check the accuracy score on the train-test-split (0.999, looks good enough to me).

Step 7: write more tests, refactor into OOP, write more tests.

Step 8: Add type hints and switch to using a public data set and pytest. Fix some stupid bugs. Write this blog post. Start preparing a package to upload to pypi for easier portability.

Within every tutorial, is another tutorial

Things I learned while following this tutorial on how to build reusable models with scikit-learn.

  1. When in doubt, go back to pandas.
  2. When in doubt, write tests.
  3. When in doubt, write helper methods to wrap existing objects, rather than creating new objects.

Ingesting "clean" data is easy, right?

Step 1 of this tutorial began with downloading data using requests, and saving that to a csv file. So I did that. I've used requests before, I had no reason to think it wouldn't work. It looked like it worked.

Step 2 was to read the file into pandas. I've read lots of csv files into pandas before, so I had no reason to think it wouldn't work.

It didn't work.

I double-checked that I had followed the instructions correctly, and then checked a few more times before concluding that something was not quite right about the data.

I went back and did the easy thing, just printing out the response from requests.

After some digging, I figured out that response.content is not the same as response.text.

The tutorial said to use response.content, but response.text seemed to have actually parsed the strings.

Even with that fix, pandas was refusing to read in more than the first row of data, due to a couple of problems:

  • pandas wasn't finding the line terminators (nothing special, just '\n')
  • pandas wasn't finding equal numbers of items per row

Unexpectedly, when I went back to what I usually do, just plain old pandas.read_csv, this time going directly from the url, and including the column names, that actually worked.

So it was actually better, and a lot less code, to completely skip using requests.


Testing always gets me unstuck

I really liked the end-to-end structure of this tutorial, and was frankly embarrassed that I had so much trouble getting the initial ingestion to work.

I liked that the tutorial gave me an excuse to walk through how the author actually uses scikit-learn models in production. With the data firmly in hand, the data visualization steps were easy - they worked as advertised, and anyway I'm very familiar with using seaborn to make charts in python.

I had never created a Bunch object before, so that was new for me. That seemed to work, but then the next steps again failed, and I had to back up a few steps.

I wasn't sure what the problem was, so I did what I always do with complicated problems, and wrote some tests to rule out user error and make sure I understood what the code was doing. That helped a lot, and identified what was actually broken.

The problem: how to apply LabelEncoder to help convert categorical data, and Imputer to help fill missing data, to multiple columns.

Because the idea was to do this in the context of a Pipeline object, the author demonstrated how to create our own Encoder and Imputer objects, with multiple inheritance. I understand the goal of this: take advantage of the nice clean syntax you get from making a Pipeline. But it was failing at the fit_transform step, and it wasn't obvious why.

The fit() and transform() steps both seemed to be working individually and sequentially, and it wasn't easy to figure out how the fit_transform step was supposed to do anything more than chain them together.

After banging my head on this at the end of a long day, even going back to the original scikit-learn source code in an effort to design tests to help me figure out what was wrong, I decided to sleep on it.


Simple and working is better than complicated and broken

I seriously considered writing tests for our custom Encoder and Imputer objects, but then it dawned on me that I really didn't need to do that. I decided that the Pipeline functionality was so simple that I didn't really need it, so I just stripped the objects down into simple functions to run the fit and transform steps, which was really all I needed anyway.

That got me through the rest of the steps, so I could practice pickling a model and re-loading it, which seemed to work just fine.

I don't know if the scikit-learn folks have plans to extend these methods, or if everyone normally does these kinds of acrobatics to encode and impute on multiple columns - normally I would just use pandas for that, too.

Shuffling the deck: an interview question

Here is a story about an interesting interview question and how I approached it.

The company in question wasn't interested in actually looking at my code, since I apparently tried to answer the wrong question.

Given a deck of n unique cards, cut the deck c cards from the top and perform a perfect shuffle. A perfect shuffle is where you put down the bottom card from the top portion of the deck followed by the bottom card from the bottom portion of the deck. This is repeated until one portion is used up. The remaining cards go on top.

Determine the number of perfect shuffles before the deck returns to its original order. This can be done in any language. A successful solution will solve the problem for 1002 cards and a cut size of 101 in under a second even on a slow machine.

I looked at that and did what they tell you to do for interviews, and coding in general, especially when you don't know where to start: start with the naive, simple approach.

Step 1. make_deck

 cards = [x for x in range(1,n+1)] 

Step 2. def shuffle(cards,c):

   """
   :param: c, where to cut the deck (int)
   """
    top = cards[0:c]
    bottom = cards[c:]

    stopping_criteria = min(len(top), len(bottom))

    newstack = deque()

    for i in range(stopping_criteria):
        newstack.append(top.pop())
        newstack.append(bottom.pop())

    if (len(top)==0) and (len(bottom)==0):
        return newstack

    elif len(top) > 0:
        newstack.extendleft(top)
    elif len(bottom) > 0:
        newstack.extendleft(bottom)

    return newstack
   

Step 3. def shuffle_recursive(cards, c, shuffle_count):

    """
    shuffle until the original order is restored, and count as you go.
    assuming for now that original order is sequential and first card is always 1.

    :param n: deck size to pass to shuffle function (int)
    :param c: cut size to pass to shuffle function (int)
    :param newstack: variable to hold the list during shuffling
    :return: (newstack (shuffled list), shuffle_count (int)) as a tuple
    >>> shuffle_recursive([1,2,3,4,5], 3, 0)
    4
    """
    newstack = shuffle(cards,c)

    shuffle_count +=1

    if list(newstack) == [x for x in range(1, len(cards)+1)]: #stopping criteria
        return shuffle_count

    else:
        return shuffle_recursive(list(newstack), c, shuffle_count)

So I did that, and was surprised to get a recursion depth error.

Then I realized it only works up to the max recursion depth of 999.

Also, it was obviously too slow.

So I did some profiling, and found that the majority of time was spent in these 3 lines:

   for i in range(stopping_criteria):
        newstack.append(top.pop())
        newstack.append(bottom.pop())

And that kind of surprised me, since I thought the whole point of deque() is that it's supposed to be faster.

So then I spent some time thinking about how I could possibly make the code go faster.

Ultimately I ended up directly creating the interleaved bottom part of the deck, and then added the top. I noticed that the tricky part was dealing with the leftover cards. I also noticed that it took a lot fewer iterations to get back to the starting order if I reversed the top cards before I put them back.

Then I hooked that up to run iteratively, so I could control the number of times it ran, for debugging, etc.

The code is here if you want to see what I did.

I wrote a bunch of tests while I was doing this, like I always do, and I couldn't help noticing that there were some weird edges cases that never worked.

I tried to read some advanced math articles, which led me to understand that the weird edge cases I was seeing were probably harmonics.

Then, because I'm really a data scientist at heart, I wanted to see what that looked like.

I wrote a couple of methods to help me visualize the results.

Overall, I'd say it was a great coding challenge, really interesting and I learned a lot.

However. When I went to turn in my work, the response was less than encouraging.

I wrote:

I came up with a simple, very slow (10 second+ run-time) solution fairly quickly, and then spent 3-4x more time coming up with a 10x faster solution.

What I have right now meets the requirement for 1002 cards with cut size 101 in under a second on my mac laptop (see below - not sure what you define as a "slow machine"?).

And the reply came back:

What answer did your solution arrive at for the test case? Is is 790034? That's not correct, so if that's the case you should take another look. It should only take a tenth of a second or so.

Apparently I was so annoyed at the way this exchange ended that I deleted both my response (let's consider it redacted) and theirs. I said something about how if the point was that it was a coding exercise, maybe they'd want to see my code even if I got a different answer (I did)?

They said I should have known I wasn't supposed to try to actually make the decks based on how the question was worded.

I did not know that. I'm not sure why it's so hard to just ask a straightforward question instead of including, as part of the challenge, that I should be able to read your mind.

Anyway, they did not want to see my code.

Shortly thereafter, I asked a friend who is more of an algorithms person and he said "Oh yeah, all you do is write the equation for a single card to get back to its original position, and then you have the answer."

Of course, I found that confusing, because based on what I did, I don't think it's really that simple. I think it depends on how you do the shuffling, e.g. whether you reverse the top half when you add it back on. Which the original question said nothing about.

And some cards (as the edge cases show) will take a much longer time to get back to their original position, depending on where you cut the deck and how many shuffles you do.

So, my shuffles might be imperfect, and my ability to read interviewers' minds hasn't improved much. But hey, those harmonics are pretty interesting.

GitHub – szeitlin

Sam Zeitlin

San Francisco

Former research scientist, self-taught pythonista. Yes, I have a PhD in biochemistry, and I write ...