Data Science Pathways

As a field that is actually a synthesis of a number of well-established fields, the skills that make for a successful data scientist come from a variety of disciplines including statistics, mathematics, and computer science. Navigating a pathway through developing skills in all of these fields can be challenging. To help provide pathways through data science skill development, I've compiled a list of resources building and expanding data science knowledge:


Computational Tools

back to top

There are a vast array of tools that can be used for solving problems in data science. Some are programming languages or environments, others are useful packages for solving specific problems or communicating and visualizing your results.

Programming Languages

Almost any programming language can be used to solve computational problems, although there are a few that outshine in terms of built in packages and user support communities. Most notably, Python and R have excelled in these respects and are also freely available.

Python. My personal preference is Python. Python is a powerful, general purpose, dynamic programming language that is has extensive packages for scientific computation (NumPy, SciPy, Pandas), advanced plotting (matplotlib), and machine learning (scikit-learn). For this sort of scientific computing, using an integrated development environment (IDE) greatly enhances productivity and the most versatile tool in this space is VS Code. An alternative IDE is Spyder.

R. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows, and OSX. With the RStudio integrated development environment (IDE), the language can be powerfully wielded for rapid analyses. Additionally, R Shiny can turn R analyses into interactive web applications.

MATLAB. A numerical computing environment and programming language with a wide set of standard toolboxes including those for statistics and machine learning.

Julia. A newer programming language designed to meet the needs of mathematical computing.

Version Control

Almost any data science project worth doing requires significant numbers of revisions and collaboration. These tools allow for comprehensive Git-based version control with a web-based repository. Github is the most popular, but all offer similar web-based repository services.

Git. Open source distributed version control system. Git is often used with a web-based Git repository hosting service such as Github.

Code Sharing and Dissemination

Jupyter Notebook. This web application allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Github Pages / Github.io. Github Pages allows you to create a web page from a Github repository and convert plain text into a formatted web document.

Visualization

D3.js. D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but do require significant skill to use effectively.

Tableau. Proprietary desktop and web-based visualization tools that include many data visualization techniques for the rapid development of professional visualizations.

Database Management (for big and little data)

MySQL. An open source relational database management system using SQL.

Apache Hadoop. An open source framework for distributed file storage and processing (often associated with “big data”) that uses the Hadoop Distributed File System (HDFS) for storage and the MapReduce algorithm for data processing.

MongoDB. A document-oriented NoSQL database (non-relational database, which does not rely on tables for storing data) capable of handling a wider variety of data types than traditional SQL relational databases.


References

Recommendations are indicated with a star ().


back to top
Title Author(s) Topic Year
Computational and Inferencial Thinking Adhikari, Ani and John DeNero Machine Learning 2021
Basic Probability Theory Ash, Robert B. Probability and Statistics 1970
Git Tutorial Atlassian Programming - Version Control Unknown
Bayesian Reasoning and Machine Learning Barber, David Machine Learning - Bayesian Methods 2020
Code First Machine Learning Batra, Nipun Machine Learning 2022
A First Course in Linear Algebra Beezer, Robert Arnold Mathematics - Linear Algebra 2015
Pattern Recognition Bishop, Christopher Machine Learning 2006
Introduction to Probability Blitzstein, Joseph and Jessica Hwang Probability and Statistics 2019
Open Data Science Masters Curriculum Carethell, Clare Data Science as a Field 2015
Git Internals Chacon, Scott Programming - Version Control 2008
Reproducible Data Science with Python Danchev, Valentin Programming 2021
A Course in Machine Learning Daumé III, Hal Machine Learning 2015
Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference Davidson-Pilon, Cameron Machine Learning - Bayesian Methods 2015
Mathematics for Machine Learning Deisenroth, Marc Peter, A Aldo Faisal, and Cheng Soon Ong Mathematics 2019
Composing Programs DeNero, John Programming - Python Unknown
OpenIntro Statistics Diez, David, Christopher Barr, and Mine Cetinkaya-Rundel Probability and Statistics 2015
50 Years of Data Science Donoho, David Data Science as a Field 2015
Think Bayes: Bayesian Statistics Made Simple Downey, Allen Machine Learning - Bayesian Methods 2013
Think Complexity: Complexity Science and Computational Modeling Downey, Allen Programming - Python 2012
Think Python: How to Think Like a Computer Scientist Downey, Allen Programming - Python 2015
Think Stats: Probability and Statistics for Programmers Downey, Allen Probability and Statistics 2014
The Elements of Statistical Learning (2nd Edition) Friedman, Jerome, Trevor Hastie, and Robert Tibshirani Machine Learning 2009
Bayesian Optimization Garnett, Roman Machine Learning - Bayesian Methods 2022
Hands-On Machine Learning with Scikit-Learn and TensorFlow Géron, Aurélien Machine Learning 2017
Deep Learning Goodfellow, Ian, Yoshua Bengio, and Aaron Courville Machine Learning - Deep Learning and Neural Networks 2016
Open Geo Tutorial Gray, Patrick Geospatial 2022
Grinstead and Snell’s Introduction to Probability Grinstead, Charles and James Snell Probability and Statistics 2006
Calculus 1, 2, and 3. 3rd Edition Hartman, Gregory Mathematics - Calculus 2015
Data Visualization: A Practical Introduction Healy, Kieran Visualization - Design 2018
Causal Inference: What If Hernan, Miguel and James Robins Other 2019
Forecasting: Principles and Practice (3rd Ed.) Hyndman, Rob and Athanasopoulos, George Machine Learning 2021
An Introduction to Statistical Learning James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani Machine Learning 2021
Data Science at the Command Line Janssens, Jeroen Other 2019
A Brief Introduction to Neural Networks Kriesel, David Machine Learning - Deep Learning and Neural Networks 2007
Principles and Techniques of Data Science Lau, Sam, Joey Gonzalez, and Deb Nolan Machine Learning Unknown
Mining of Massive Datasets Leskovec, Jure, Anand Rajaraman, and Jeffrey Ullman Machine Learning 2014
Data-Intesive Text Processing with MapReduce Lin, Jimmy, and Chris Dyer Programming - MapReduce 2010
Machine Learning - A First Course for Engineers and Scientists Lindholm, Andreas, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön Machine Learning 2022
Information Theory, Inference and Learning Algorithms MacKay, David Probability and Statistics - Information Theory 2003
D3 Tips and Tricks Maclean, Malcom Visualization - D3 2013
Calculus 1, 2, and 3. 2nd Edition Marsden, Jerrold and Alan Weinsten Mathematics - Calculus 1985
Python for Data Science, 3E McKinney, Wes Programming - Python 2022
Foundations of Machine Learning Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar Machine Learning 2018
Interpretable Machine Learning Molnar, Christoph Machine Learning 2019
Machine Learning: a Probabilistic Perspective Murphy, Kevin Machine Learning 2012
Probabilistic Machine Learning: An Introduction Murphy, Kevin Machine Learning 2022
Probabilistic Machine Learning: Advanced Topics Murphy, Kevin Machine Learning 2023
Interactive Data Visualization for the Web Murray, Scott Visualization - D3 2013
Deep Learning Tutorial Ng, Andrew Machine Learning - Deep Learning and Neural Networks Unknown
Neural Networks and Deep Learning Nielsen, Michael Machine Learning - Deep Learning and Neural Networks 2016
The Quest for Artificial Intelligence: A History of Ideas and Achivements Nilsson, Nils Data Science as a Field 2010
The Not So Short Introduction to LATEX 2ε Oetiker, Tobias Other 2016
The Python Tutorial Python Software Foundation Programming - Python 2017
Python Machine Learning, Second Edition Raschka, Sebastian Machine Learning 2017
Gaussian Processes for Machine Learning Rasmussen, Carl Edward, and Christopher Williams Probability and Statistics - Gaussian Processes 2006
The Hitchhiker’s Guide to Python Reitz, Kenneth and‎ Tanya Schlusser Programming - Python 2016
From Python to Numpy Rougier, Nicholas Programming - Python 2017
Scientific Visualization: Python + Matplotlib Rougier, Nicholas Visualization - Python 2021
Python and OpenGL for Scientific Visualisation Rougier, Nicholas Visualization - Python 2018
Python for Informatics Severance, Charls Programming - Python Unknown
Understanding Machine Learning: From Theory to Algorithms Shalev-Shwartz, Shai and Shai Ben-David Machine Learning 2014
Advanced Data Analysis from an Elementary Point of View Shalizi, Cosma Machine Learning Unknown
Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code Shaw, Zed Programming - Python 2013
Immersive Linear Algebra Ström, Jacob, Kalle Åström, and Tomas Akenine-Möller Mathematics - Linear Algebra 2019
Reinforcement Learning: An Introduction Sutton, Richard, and Andrew Barto Machine Learning - Reinforcement Learning 2018
Automate the Boring Stuff with Python: Practical Programming for Total Beginners Sweigart, Al Programming - Python 2016
Linear Algebra Done Wrong Treil, Sergei Mathematics - Linear Algebra 2004
Introduction to Applied Linear Algebra Vandenberghe, L. Mathematics - Linear Algebra 2017
Python Data Science Handbook VanderPlas, Jake Programming - Python 2016
Scipy Lecture Notes Varoquaux et al. Programming - Python 2017
Fundamentals of Data Visualization Wilke, Claus Visualization 2018
A Visual Introduction to Machine Learning Yee, Stephanie, and Tony Chu Machine Learning Unknown
Dive Into Deep Learning Zhang, Aston, Zachary C. Lipton, Mu Li, and Alexander J. Smola Machine Learning - Deep Learning and Neural Networks 2022

Courses with Online Materials

Recommendations are indicated with a star ().

back to top
Title Instructor Designation University Year
A Mathematics Course for Political and Social Researchers Siegel None Duke University 2014
Artificial Intelligence Winston OpenCourseware Massachusetts Institute of Technology 2010
Computational Linear Algebra for Coders Thomas None University of San Francisco 2017
Computational Statistics in Python Chan STA 663 Duke University 2017
Course: computer vision Tomasi COMPSCI 527 Duke University 2019
Course: machine learning, Sebastian Raschka Raschka STAT 479 University of Wisconsin-Madison 2018
Data 100: Principles and Techniques of Data Science Hug nan University of California, Berkeley 2022
Data 8: Foundations of Data Science DeNero Data 8 University of California, Berkeley 2022
Deep Learning for Computer Vision Li CS 231n Stanford University 2022
Deep Learning for Computer Vision Johnson EECS 498/598 University of Michigan 2022
Deep Reinforcement Learning Sergey CS 285 University of California, Berkeley 2021
Fast.ai Code-First Intro to Natural Language Processing Thomas Fast.ai NLP University of San Francisco 2019
Foundations of Machine Learning Rosenberg nan Bloomberg Unknown
Full Stack Deep Learning - nan Multiple 2021
Introduction to Data Science Little CS 5963 University of Utah 2022
Introduction to Data Science for Public Policy Chen PPOL 670 Georgetown University 2018
Introduction to Deep learning Smola STAT 157 University of California, Berkeley 2019
Introduction to Deep Learning Raschka nan University of Wisconsin-Madison 2021
Learning From Data Abu-Mostafa MOOC California Institute of Technology 2010
Machine Learning Ng CS 229 Stanford University Unknown
Machine Learning Bloem nan Vrije Universiteit Amsterdam 2019
Machine Learning Raschka nan University of Wisconsin-Madison 2021
Machine Learning Batra nan IIT Gandhinagar 2022
Mining Massive Data Sets Leskovec CS 246 Stanford University 2019
Missing Semester of Your CS Education Athalye nan Massachusetts Institute of Technology 2020
Natural Language Processing with Deep learning Manning CS 224n Stanford University 2022
Practical Deep Learning for Coders, v3 Howard Fast.ai v3 Fast.ai 2019
Practical Programming Python Beazley nan Unknown 2020
Principles of Machine Learning Bradbury IDS 705 Duke University 2022
Probabilistic Machine Learning Batra nan IIT Gandhinagar 2022
Reinforcement Learning Silver COMPM 050 University College London 2015
Reinforcement Learning Lecture Series DeepMind nan University College London 2021
Statistical Computing for Scientists and Engineers Zabaras None University of Notre Dame 2018
Structure and Interpretation of Computer Programs DeNero CS 61A University of California, Berkeley 2019
Theories of Deep Learning Donoho STATS 385 Stanford University 2017

Tools

back to top
Name Topic Description
Altair Visualization Beautiful declarative statistical visualization library for Python
Anaconda Python Distribution Python Distribution for Python with package manager
Aquarel Visualization Matplotlib plotting themes
Arxiv Sanity Preserver Research Searching Time-saving tool for searching arxiv.org
arXiv.org Research Searching Open access scholarly e-prints
Authorea Collaborative Writing Online scientific document collaboration
Binder Programming Have a repository full of Jupyter notebooks? With Binder, open those notebooks in an executable environment, making your code immediately reproducible by anyone, anywhere.
Bokeh Python Interactive plotting tools
cmder Command Line Console emulator for Windows
Colorgorical Color Palette Generator Online color palette generator
Comet Experiment tracking Experiment tracking, dataset versioning, and model management
CommonMark Markdown Language Markdown Language
Comprehensive Python CheatSheet Python Extensive Python cheat sheet
Connected Papers Research Searching Explore connected papers in a visual graph
Convnet.js Machine Learning Simple deep learning package for javascript
Cookie Cutter Data Science Programming A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.
Copy via SSH Web Publishing Simple GitHub Action to copy a folder or single file to a remote server using SSH.
D3.js Interactive visualization D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but does require significant skill to use effectively
Dask Python Parallelize any Python code with Dask Futures, letting you scale any function and for loop, and giving you control and power in any situation
Data Wrapper Visualization Create beautiful charts
Deepnote Collaborative Writing Collaborate on Jupyter-like notebooks with enhanced visualization, collaboration, and integration features
Dillinger.io Markdown Language Markdown editor
Draw.io Graphics Online graphics platform
Explain Shell Command Line Seach command-lines to see the help text that matches each argument
Fabric Python Command line automation tool
Fast Pages Web Publishing An easy to use blogging platform with extra features for Jupyter Notebooks
Fast.ai Machine Learning The fastai library simplifies training fast and accurate neural nets using modern best practices
Git Version Control Open source distributed version control system - the de facto standard
Github Version Control Web hosting for git repositories
Github Pages Web Publishing Host web pages from Github repositories
Github Repo Badges Programming Github repo badges
Google AI Hub Machine Learning A centralized repository for developers and data scientists building artificial intelligence (AI) systems
Google Colab Compute Colab allows you to write and execute Python code in your browser with free access to GPUs
Google Earth Engine Geospatial data Geospatial data analysis tool
Google Scholar Research Searching Search tool for scholarly literature
Google Scholar Top Publications Publication Rankings Searchable list of conference and journal rankings by field
Google Seedbank Machine Learning Machine learning idea generation tool
Google Style Guide Programming Style guide for Python, R, Shell, HTML, CSS, Javascript, Java, and C++
Ground Work Labelling A free image labeling tool for creating custom training datasets from satellite imagery
Guild.ai Experiment tracking Run, track, and compare machine learning experiments
Hackmd.io Markdown Language Markdown editor
Jupyter Notebook Programming This application allows you to create and share documents that contain live code, equations, visualizations and explanatory text
Keras Machine Learning High-level neural network Python library enabling fast experimentation
Latex for PowerPoint Math Editor Use Latex to write and edit math in PowerPoint
Mathcha.io Math Editor Online math editor that is what-you-see-is-what-you-get; also allows latex syntax
Matplotlib Cheat Sheets Visualization Nearly-comprehensive matplotlib cheatsheet
Mode SQL Simple SQL query and visualization tool
Observable Visualization Collaboratively explore, analyze, visualize, and communicate with data on the web
Oh my zsh Command Line Command line themes and plugins, better tab completion and more
Open source license guide License A guide to choosing an open source license
OpenAI Gym Reinforcement Learning A toolkit for developing and comparing reinforcement learning algorithms
OpenAI Universe Reinforcement Learning A toolkit for developing and comparing reinforcement learning algorithms, particularly video games
OpenMapFlow Geospatial data Rapid map creation with machine learning and earth observation data.
Our World in Data Grapher Visualization Open source tools for data visualization
Overleaf Collaborative Writing Online LaTeX collaboration
Papers with Code Research Searching Free and open resource with Machine Learning papers, code, datasets, methods and evaluation tables
Plot.ly for Python Python Interactive plotting tools
PyFormat Python Explanation of formatting in Python
Pytorch Machine Learning Open source deep machine learning framework
Raster Vision Geospatial data Raster Vision is an open source Python framework for building computer vision models on satellite, aerial, and other large imagery sets
Regexr Regular Expressions Interactive regular expression playground
Reinforce JS Machine Learning Reinforcement learning package for Javascript
Rodeo Python A Python integrated development environment
SACRED Experiment tracking Sacred is a tool to help you configure, organize, log and reproduce experiments
Scimago Journal Rankings Publication Rankings Searchable list of conference and journal rankings by field
Scrapy Web Scraping Scrape data from the web
Scrollama Interactive visualization Scrollers for interactive web visualizations
Semantic Sanity Research Searching Time-saving tool for searching arxiv.org
Semantic Scholar Research Searching AI-backed search engine for scientific journals
Snorkel Machine Learning Programmatically building and managing training data
So You Want to Build A Scroller Interactive visualization Scrollers for interactive web visualizations
Solaris Geospatial data An open source machine learning pipeline for geospatial imagery
Stackedit.io Markdown Language Markdown editor
Style Guide for Python Code Python Programming style guide
Tableau Data visualization Graphical user interface-based data visualization tool
Tabula Data Scraping Extract data from tables
Tensorflow Machine Learning Open source deep machine learning framework
Tensorflow JS Machine Learning Deep learning package for Javascript
Tensorflow Playground Neural Networks Interactive neural network playground
Tensorwatch Debugging Deep learning debuggin tool from Microsoft
Testbook Programming A unit testing framework for testing code in Jupyter Notebooks
The Neural Network Zoo Neural Networks A graphical cheat sheet for neural network architectures and acronyms
Tmux Programming Terminal multiplexer
Tuna Python Python profiler and performance analysis tool
Visualization Style Guide Visualization Urban Institute visualization style guide - great set of principles to inspire good data visualization practices
Weights and biases Experiment tracking Record metrics, visualize training and share findings
Xarray Python Xarray makes working with labelled multi-dimensional arrays in Python simple, efficient
Zenodo Reference Management Open source data repository
Zotero Reference Management Reference and citation management system for research