Programming languages for project analytics

September 25, 2019

If you’re late to the world of programming, now is the best opportunity for entry.

While a deep understanding of programming is definitely not necessary to do data analytics, having a familiarity of the tools available is beneficial if you are looking to take your analytics to the next level. It’s also important to have some understanding of the tools available, and their differences, so that if you have team members with these skill sets you can communicate with them better and strengthen stakeholder alignment within your project team. This article will hopefully give you some primer in where to begin, and some background on languages for project analytics.

As someone who has programmed nearly 20 years, I can attest we are now living in a world of extreme rapid prototyping supplemented by vast online assistance and communities. Many new learners can now do away with two hundred page text books, which are not digitally searchable, and turn to the web for specific problem support. A new form of textbook has also been emerging via magazines - publishers are printing succinct 50 page magazines that get direct to the point to learn languages such as Python.

Also we are living in a world where the barriers to entry on even traditional programming languages such as C are drastically minimalized. Integrated development environments are intuitive, resources such as Stack Overflow generate responses within minutes, and Google almost always knows the exact results to display with even a vague description of your issue. One could go as far to say that the bigger skill is now laid upon the coder’s ability to diligently parse and hone in on search results for solutions to a problem.

The uprising of interpreted programming languages

Perhaps one of the biggest strides in this new-found easiness is the shift away from compiled languages – where you feed code to a compiler, it checks for errors, links libraries, and generates a binary - into interpreted languages. Interpreted languages differ in that they are fed to an interpreter (another piece of software), and the code is ran and executed in lock step with the interpreter.

One of the key reasons interpreted languages are becoming more commonplace is that computers are now fast enough to interpret and execute simultaneously, and also it almost entirely eliminates systems configurations, library, and compiler nuances that dissuaded most from even trying to get into programming. Also, it is much easier to make little edits and instantly run a piece of code to see the results, instead of the traditional compile, run, and debug steps previously undertaken. Complementing these benefits is the added benefit of entire development suite package managers becoming commonplace. Anaconda, for instance, bundles practically all the Python libraries, debuggers, and graphic interfaces into one suite wherein you can easily add or remove modules with traditional point and click operations.

While interpreted languages come with a performance penalty, in the world of data analytics and big data, the penalty is generally hedged with faster networks, faster disk drives, more RAM, and more processor cores. Academically, this is not the ideal solution to throw more hardware at poorly written or slow languages, but in commercial environments the benefits of accessibility and easy sharing of interpretative code has its own payment rewards. Even with standard company computer hardware, most project analytics exercises will run in reasonable time.

Compiled Code Benefits

That said, compiled languages still have extremely important roles in data analytics. A few key benefits are that they are much faster (as they are already in machine code), they can be a single compact executable, they often take up far less disk space, and they can be a little less intimidating for typical employee citizens to run, as they can be encapsulated with a GUI for the familiar point-and-click environments. A good example of where speed is critical is the Great Internet Mersenne Prime Search which, currently, is trying to find the 51st mersenne prime number and through a distributed network using 1.9 million cpu cores. Since 1996, only 50 of such prime numbers have been discovered giving an indication of how difficult a problem this is. You also, of course, would not find an interpreted language ever running a power plant, nor would you find it in a military jet - in fact, the Department of Defense went as far as creating their own language, Ada.

Compiled vs Interpreted

Writing a C style neural network, even using libraries, will result in several hundred extra lines of code compared to the Python language. C++ and C have many similarities; speaking broadly, C++ is what is referred to as object-oriented and introduces to C a concept of relocatable entities or chunks of code. A basic example of this is calling a deck of cards an ‘object’, and upon that object you can invoke certain actions like ‘shuffle’, ‘draw a card’, ‘sort the deck’, etc. And you can have multiple decks of cards (‘objects’) in your code without having to write multiple chunks of code that represent these various decks of cards.

Java is another object-oriented language but has the distinction that it runs on a portable virtual machine; that is, Java code can run across various operating systems with no modifications – this comes at performance penalty similar to that of Python, as now you are emulating a computer within a computer to run Java code.

What are the key programming options for data analytics?

Currently the IEEE and TIOBE both recognize Python (interpretative), Java (compiled), C (compiled) and C++ (compiled) as the top 4 overall programming languages based on various market share indexes. Python is first on IEEE and fourth on TIOBE. These two indexes taken in union are a very good indication of the current state of programming languages and where your knowledge is best invested.

Personally, there has never been a problem I haven’t been able to solve in C and the standard ANSI libraries using the command line. However, while it solves my problems, it does not necessarily solve company problems, and specifically has its difficulties in team-based information sharing methodologies. I would suggest people interested in data analytics starts with Python (or R, Matlab, Scala) and if they have a sincere interest in computing power and low level coding move into more close to the metal languages such as C further into your programming journey.

There is never a definitive answer to what the best programming language is, and truthfully a good programmer has the wisdom to stay out of such debates. Every language has benefits and flaws - some, even with many flaws, are great languages to some individuals because the have a certain syntax that may be easier to remember than others, or may appear less intimidating as an entry point.

However, if you’re looking for the trifecta of ease of use, support networks, and employability, Python is going to be your best bet to invest your time wisely. Also if you’re familiar with Python you’ll pick up other languages thereafter quite quickly. I always make the comparison to knowing how to play guitar - if you can manage playing rock music, changing to country music is often as easy as changing tune, rhythm and notes; but the foundations of music still persist.

When a beginner to data analytics looks at an interpreted language such as Python code, they aren’t blasted with a wall of text, function naming conventions are intuitive, APIs are well documented, libraries are trusted, and anyone can make team-based edits and see what they can break! The caveat to this, is the big toolset that python code must lug with it – almost any project analytics code in Python will need imported libraries, however with tools such as Jupyter Notebooks that you can share with other users, you can keep any library issues at bay as your code will run on a remote server with everything installed.

To get a general idea of the libraries available, below are some of the common ones used in project and data analytics:

Scientific computing libraries

Pandas - Data structures & tools
NumPy - Arrays and matrices
SciPy - Integrals, differential equations, optimization

Visualization libraries:

Matplotlib - plots & graphs
Seaborn - heat maps, time series, violin plots

Algorithmic Libraries:

Scikit-learn - Machine Learning, regression
Statsmodels - Explore data, estimate statistical models, perform tests

Code utilizing these libraries of course relies on everyone having it (or the knowledge to import it) so the interpreter can use those libraries; and even if you use only one or two segments of those libraries, you typically import the whole library. Whereas in compiled code, the linker will take exactly the elements required, and can statically link those functions in a single tidy executable.

However, to make usage of your programs effectively, the concept of Jupyter Notebooks has been established. Jupyter Notebooks when installed on a server allows you to make a “notebook”, effectively similar to that of a web page, wherein you can host all your code alongside html and markdown, and allow users to run your code segments manually themselves. This server hosting the notebooks has the advantage that it also has Python and the associated libraries installed, so the ultimate end-user of your work doesn’t need to be familiar with system installations, and merely had to click a play button to get your program to run or refresh existing data for analysis.

If you eventually become familiar with R as well, you can move into Zeppelin Notebooks which allows you to intermix various programming languages and use the result from one language as as the input to another.

So where to learn Python?

I highly suggest the IBM Data Science Professional Certificate from Coursera, having taken the program myself. It is 10 courses and can be completed within a few months, and covers not only programming in Python, but starts from the foundations of data collection and data cleansing, and leads into data warehouses followed by analysis, visuals, and ultimately storytelling. It is a very well thought out program and certainly has all the foundations to getting any beginner started properly.

Also, if you’re already familiar with Python - the course is still quite valuable; I went in a skeptic but was pleasantly surprised at how much infrastructure IBM has put under their cloud computing umbrella - they provide DB2 warehousing and Jupyter Notebook hosting for students of the program for free!