As a field that is actually a synthesis of a number of well-established fields, the skills that make for a successful data scientist come from a variety of disciplines including statistics, mathematics, and computer science. Navigating a pathway through developing skills in all of these fields can be challenging. To help provide pathways through data science skill development, I've compiled a list of resources building and expanding data science knowledge:
There are a vast array of tools that can be used for solving problems in data science. Some are programming languages or environments, others are useful packages for solving specific problems or communicating and visualizing your results.
Almost any programming language can be used to solve computational problems, although there are a few that outshine in terms of built in packages and user support communities. Most notably, Python and R have excelled in these respects and are also freely available.
Python. My personal preference is Python. Python is a powerful, general purpose, dynamic programming language that is has extensive packages for scientific computation (NumPy, SciPy, Pandas), advanced plotting (matplotlib), and machine learning (scikit-learn). For this sort of scientific computing, using an integrated development environment (IDE) greatly enhances productivity and the most versatile tool in this space is VS Code. An alternative IDE is Spyder.
R. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows, and OSX. With the RStudio integrated development environment (IDE), the language can be powerfully wielded for rapid analyses. Additionally, R Shiny can turn R analyses into interactive web applications.
MATLAB. A numerical computing environment and programming language with a wide set of standard toolboxes including those for statistics and machine learning.
Julia. A newer programming language designed to meet the needs of mathematical computing.
Almost any data science project worth doing requires significant numbers of revisions and collaboration. These tools allow for comprehensive Git-based version control with a web-based repository. Github is the most popular, but all offer similar web-based repository services.
Git. Open source distributed version control system. Git is often used with a web-based Git repository hosting service such as Github.
Jupyter Notebook. This web application allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
Github Pages / Github.io. Github Pages allows you to create a web page from a Github repository and convert plain text into a formatted web document.
D3.js. D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but do require significant skill to use effectively.
Tableau. Proprietary desktop and web-based visualization tools that include many data visualization techniques for the rapid development of professional visualizations.
MySQL. An open source relational database management system using SQL.
Apache Hadoop. An open source framework for distributed file storage and processing (often associated with “big data”) that uses the Hadoop Distributed File System (HDFS) for storage and the MapReduce algorithm for data processing.
MongoDB. A document-oriented NoSQL database (non-relational database, which does not rely on tables for storing data) capable of handling a wider variety of data types than traditional SQL relational databases.
Recommendations are indicated with a star ().
Title | Author(s) | Topic | Year | |
---|---|---|---|---|
Computational and Inferencial Thinking | Adhikari, Ani and John DeNero | Machine Learning | 2021 | |
Basic Probability Theory | Ash, Robert B. | Probability and Statistics | 1970 | |
Git Tutorial | Atlassian | Programming - Version Control | Unknown | |
Bayesian Reasoning and Machine Learning | Barber, David | Machine Learning - Bayesian Methods | 2020 | |
Code First Machine Learning | Batra, Nipun | Machine Learning | 2022 | |
A First Course in Linear Algebra | Beezer, Robert Arnold | Mathematics - Linear Algebra | 2015 | |
Pattern Recognition | Bishop, Christopher | Machine Learning | 2006 | |
Introduction to Probability | Blitzstein, Joseph and Jessica Hwang | Probability and Statistics | 2019 | |
Open Data Science Masters Curriculum | Carethell, Clare | Data Science as a Field | 2015 | |
Git Internals | Chacon, Scott | Programming - Version Control | 2008 | |
Reproducible Data Science with Python | Danchev, Valentin | Programming | 2021 | |
A Course in Machine Learning | Daumé III, Hal | Machine Learning | 2015 | |
Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference | Davidson-Pilon, Cameron | Machine Learning - Bayesian Methods | 2015 | |
Mathematics for Machine Learning | Deisenroth, Marc Peter, A Aldo Faisal, and Cheng Soon Ong | Mathematics | 2019 | |
Composing Programs | DeNero, John | Programming - Python | Unknown | |
OpenIntro Statistics | Diez, David, Christopher Barr, and Mine Cetinkaya-Rundel | Probability and Statistics | 2015 | |
50 Years of Data Science | Donoho, David | Data Science as a Field | 2015 | |
Think Bayes: Bayesian Statistics Made Simple | Downey, Allen | Machine Learning - Bayesian Methods | 2013 | |
Think Complexity: Complexity Science and Computational Modeling | Downey, Allen | Programming - Python | 2012 | |
Think Python: How to Think Like a Computer Scientist | Downey, Allen | Programming - Python | 2015 | |
Think Stats: Probability and Statistics for Programmers | Downey, Allen | Probability and Statistics | 2014 | |
The Elements of Statistical Learning (2nd Edition) | Friedman, Jerome, Trevor Hastie, and Robert Tibshirani | Machine Learning | 2009 | |
Bayesian Optimization | Garnett, Roman | Machine Learning - Bayesian Methods | 2022 | |
Hands-On Machine Learning with Scikit-Learn and TensorFlow | Géron, Aurélien | Machine Learning | 2017 | |
Deep Learning | Goodfellow, Ian, Yoshua Bengio, and Aaron Courville | Machine Learning - Deep Learning and Neural Networks | 2016 | |
Open Geo Tutorial | Gray, Patrick | Geospatial | 2022 | |
Grinstead and Snell’s Introduction to Probability | Grinstead, Charles and James Snell | Probability and Statistics | 2006 | |
Calculus 1, 2, and 3. 3rd Edition | Hartman, Gregory | Mathematics - Calculus | 2015 | |
Data Visualization: A Practical Introduction | Healy, Kieran | Visualization - Design | 2018 | |
Causal Inference: What If | Hernan, Miguel and James Robins | Other | 2019 | |
Forecasting: Principles and Practice (3rd Ed.) | Hyndman, Rob and Athanasopoulos, George | Machine Learning | 2021 | |
An Introduction to Statistical Learning | James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani | Machine Learning | 2021 | |
Data Science at the Command Line | Janssens, Jeroen | Other | 2019 | |
A Brief Introduction to Neural Networks | Kriesel, David | Machine Learning - Deep Learning and Neural Networks | 2007 | |
Principles and Techniques of Data Science | Lau, Sam, Joey Gonzalez, and Deb Nolan | Machine Learning | Unknown | |
Mining of Massive Datasets | Leskovec, Jure, Anand Rajaraman, and Jeffrey Ullman | Machine Learning | 2014 | |
Data-Intesive Text Processing with MapReduce | Lin, Jimmy, and Chris Dyer | Programming - MapReduce | 2010 | |
Machine Learning - A First Course for Engineers and Scientists | Lindholm, Andreas, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön | Machine Learning | 2022 | |
Information Theory, Inference and Learning Algorithms | MacKay, David | Probability and Statistics - Information Theory | 2003 | |
D3 Tips and Tricks | Maclean, Malcom | Visualization - D3 | 2013 | |
Calculus 1, 2, and 3. 2nd Edition | Marsden, Jerrold and Alan Weinsten | Mathematics - Calculus | 1985 | |
Python for Data Science, 3E | McKinney, Wes | Programming - Python | 2022 | |
Foundations of Machine Learning | Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar | Machine Learning | 2018 | |
Interpretable Machine Learning | Molnar, Christoph | Machine Learning | 2019 | |
Machine Learning: a Probabilistic Perspective | Murphy, Kevin | Machine Learning | 2012 | |
Probabilistic Machine Learning: An Introduction | Murphy, Kevin | Machine Learning | 2022 | |
Probabilistic Machine Learning: Advanced Topics | Murphy, Kevin | Machine Learning | 2023 | |
Interactive Data Visualization for the Web | Murray, Scott | Visualization - D3 | 2013 | |
Deep Learning Tutorial | Ng, Andrew | Machine Learning - Deep Learning and Neural Networks | Unknown | |
Neural Networks and Deep Learning | Nielsen, Michael | Machine Learning - Deep Learning and Neural Networks | 2016 | |
The Quest for Artificial Intelligence: A History of Ideas and Achivements | Nilsson, Nils | Data Science as a Field | 2010 | |
The Not So Short Introduction to LATEX 2ε | Oetiker, Tobias | Other | 2016 | |
The Python Tutorial | Python Software Foundation | Programming - Python | 2017 | |
Python Machine Learning, Second Edition | Raschka, Sebastian | Machine Learning | 2017 | |
Gaussian Processes for Machine Learning | Rasmussen, Carl Edward, and Christopher Williams | Probability and Statistics - Gaussian Processes | 2006 | |
The Hitchhiker’s Guide to Python | Reitz, Kenneth and Tanya Schlusser | Programming - Python | 2016 | |
From Python to Numpy | Rougier, Nicholas | Programming - Python | 2017 | |
Scientific Visualization: Python + Matplotlib | Rougier, Nicholas | Visualization - Python | 2021 | |
Python and OpenGL for Scientific Visualisation | Rougier, Nicholas | Visualization - Python | 2018 | |
Python for Informatics | Severance, Charls | Programming - Python | Unknown | |
Understanding Machine Learning: From Theory to Algorithms | Shalev-Shwartz, Shai and Shai Ben-David | Machine Learning | 2014 | |
Advanced Data Analysis from an Elementary Point of View | Shalizi, Cosma | Machine Learning | Unknown | |
Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code | Shaw, Zed | Programming - Python | 2013 | |
Immersive Linear Algebra | Ström, Jacob, Kalle Åström, and Tomas Akenine-Möller | Mathematics - Linear Algebra | 2019 | |
Reinforcement Learning: An Introduction | Sutton, Richard, and Andrew Barto | Machine Learning - Reinforcement Learning | 2018 | |
Automate the Boring Stuff with Python: Practical Programming for Total Beginners | Sweigart, Al | Programming - Python | 2016 | |
Linear Algebra Done Wrong | Treil, Sergei | Mathematics - Linear Algebra | 2004 | |
Introduction to Applied Linear Algebra | Vandenberghe, L. | Mathematics - Linear Algebra | 2017 | |
Python Data Science Handbook | VanderPlas, Jake | Programming - Python | 2016 | |
Scipy Lecture Notes | Varoquaux et al. | Programming - Python | 2017 | |
Fundamentals of Data Visualization | Wilke, Claus | Visualization | 2018 | |
A Visual Introduction to Machine Learning | Yee, Stephanie, and Tony Chu | Machine Learning | Unknown | |
Dive Into Deep Learning | Zhang, Aston, Zachary C. Lipton, Mu Li, and Alexander J. Smola | Machine Learning - Deep Learning and Neural Networks | 2022 |
Recommendations are indicated with a star ().
Title | Instructor | Designation | University | Year | |
---|---|---|---|---|---|
A Mathematics Course for Political and Social Researchers | Siegel | None | Duke University | 2014 | |
Artificial Intelligence | Winston | OpenCourseware | Massachusetts Institute of Technology | 2010 | |
Computational Linear Algebra for Coders | Thomas | None | University of San Francisco | 2017 | |
Computational Statistics in Python | Chan | STA 663 | Duke University | 2017 | |
Course: computer vision | Tomasi | COMPSCI 527 | Duke University | 2019 | |
Course: machine learning, Sebastian Raschka | Raschka | STAT 479 | University of Wisconsin-Madison | 2018 | |
Data 100: Principles and Techniques of Data Science | Hug | nan | University of California, Berkeley | 2022 | |
Data 8: Foundations of Data Science | DeNero | Data 8 | University of California, Berkeley | 2022 | |
Deep Learning for Computer Vision | Li | CS 231n | Stanford University | 2022 | |
Deep Learning for Computer Vision | Johnson | EECS 498/598 | University of Michigan | 2022 | |
Deep Reinforcement Learning | Sergey | CS 285 | University of California, Berkeley | 2021 | |
Fast.ai Code-First Intro to Natural Language Processing | Thomas | Fast.ai NLP | University of San Francisco | 2019 | |
Foundations of Machine Learning | Rosenberg | nan | Bloomberg | Unknown | |
Full Stack Deep Learning | - | nan | Multiple | 2021 | |
Introduction to Data Science | Little | CS 5963 | University of Utah | 2022 | |
Introduction to Data Science for Public Policy | Chen | PPOL 670 | Georgetown University | 2018 | |
Introduction to Deep learning | Smola | STAT 157 | University of California, Berkeley | 2019 | |
Introduction to Deep Learning | Raschka | nan | University of Wisconsin-Madison | 2021 | |
Learning From Data | Abu-Mostafa | MOOC | California Institute of Technology | 2010 | |
Machine Learning | Ng | CS 229 | Stanford University | Unknown | |
Machine Learning | Bloem | nan | Vrije Universiteit Amsterdam | 2019 | |
Machine Learning | Raschka | nan | University of Wisconsin-Madison | 2021 | |
Machine Learning | Batra | nan | IIT Gandhinagar | 2022 | |
Mining Massive Data Sets | Leskovec | CS 246 | Stanford University | 2019 | |
Missing Semester of Your CS Education | Athalye | nan | Massachusetts Institute of Technology | 2020 | |
Natural Language Processing with Deep learning | Manning | CS 224n | Stanford University | 2022 | |
Practical Deep Learning for Coders, v3 | Howard | Fast.ai v3 | Fast.ai | 2019 | |
Practical Programming Python | Beazley | nan | Unknown | 2020 | |
Principles of Machine Learning | Bradbury | IDS 705 | Duke University | 2022 | |
Probabilistic Machine Learning | Batra | nan | IIT Gandhinagar | 2022 | |
Reinforcement Learning | Silver | COMPM 050 | University College London | 2015 | |
Reinforcement Learning Lecture Series | DeepMind | nan | University College London | 2021 | |
Statistical Computing for Scientists and Engineers | Zabaras | None | University of Notre Dame | 2018 | |
Structure and Interpretation of Computer Programs | DeNero | CS 61A | University of California, Berkeley | 2019 | |
Theories of Deep Learning | Donoho | STATS 385 | Stanford University | 2017 |
Name | Topic | Description |
---|---|---|
Altair | Visualization | Beautiful declarative statistical visualization library for Python |
Anaconda Python Distribution | Python | Distribution for Python with package manager |
Aquarel | Visualization | Matplotlib plotting themes |
Arxiv Sanity Preserver | Research Searching | Time-saving tool for searching arxiv.org |
arXiv.org | Research Searching | Open access scholarly e-prints |
Authorea | Collaborative Writing | Online scientific document collaboration |
Binder | Programming | Have a repository full of Jupyter notebooks? With Binder, open those notebooks in an executable environment, making your code immediately reproducible by anyone, anywhere. |
Bokeh | Python | Interactive plotting tools |
cmder | Command Line | Console emulator for Windows |
Colorgorical | Color Palette Generator | Online color palette generator |
Comet | Experiment tracking | Experiment tracking, dataset versioning, and model management |
CommonMark | Markdown Language | Markdown Language |
Comprehensive Python CheatSheet | Python | Extensive Python cheat sheet |
Connected Papers | Research Searching | Explore connected papers in a visual graph |
Convnet.js | Machine Learning | Simple deep learning package for javascript |
Cookie Cutter Data Science | Programming | A logical, reasonably standardized, but flexible project structure for doing and sharing data science work. |
Copy via SSH | Web Publishing | Simple GitHub Action to copy a folder or single file to a remote server using SSH. |
D3.js | Interactive visualization | D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but does require significant skill to use effectively |
Dask | Python | Parallelize any Python code with Dask Futures, letting you scale any function and for loop, and giving you control and power in any situation |
Data Wrapper | Visualization | Create beautiful charts |
Deepnote | Collaborative Writing | Collaborate on Jupyter-like notebooks with enhanced visualization, collaboration, and integration features |
Dillinger.io | Markdown Language | Markdown editor |
Draw.io | Graphics | Online graphics platform |
Explain Shell | Command Line | Seach command-lines to see the help text that matches each argument |
Fabric | Python | Command line automation tool |
Fast Pages | Web Publishing | An easy to use blogging platform with extra features for Jupyter Notebooks |
Fast.ai | Machine Learning | The fastai library simplifies training fast and accurate neural nets using modern best practices |
Git | Version Control | Open source distributed version control system - the de facto standard |
Github | Version Control | Web hosting for git repositories |
Github Pages | Web Publishing | Host web pages from Github repositories |
Github Repo Badges | Programming | Github repo badges |
Google AI Hub | Machine Learning | A centralized repository for developers and data scientists building artificial intelligence (AI) systems |
Google Colab | Compute | Colab allows you to write and execute Python code in your browser with free access to GPUs |
Google Earth Engine | Geospatial data | Geospatial data analysis tool |
Google Scholar | Research Searching | Search tool for scholarly literature |
Google Scholar Top Publications | Publication Rankings | Searchable list of conference and journal rankings by field |
Google Seedbank | Machine Learning | Machine learning idea generation tool |
Google Style Guide | Programming | Style guide for Python, R, Shell, HTML, CSS, Javascript, Java, and C++ |
Ground Work | Labelling | A free image labeling tool for creating custom training datasets from satellite imagery |
Guild.ai | Experiment tracking | Run, track, and compare machine learning experiments |
Hackmd.io | Markdown Language | Markdown editor |
Jupyter Notebook | Programming | This application allows you to create and share documents that contain live code, equations, visualizations and explanatory text |
Keras | Machine Learning | High-level neural network Python library enabling fast experimentation |
Latex for PowerPoint | Math Editor | Use Latex to write and edit math in PowerPoint |
Mathcha.io | Math Editor | Online math editor that is what-you-see-is-what-you-get; also allows latex syntax |
Matplotlib Cheat Sheets | Visualization | Nearly-comprehensive matplotlib cheatsheet |
Mode | SQL | Simple SQL query and visualization tool |
Observable | Visualization | Collaboratively explore, analyze, visualize, and communicate with data on the web |
Oh my zsh | Command Line | Command line themes and plugins, better tab completion and more |
Open source license guide | License | A guide to choosing an open source license |
OpenAI Gym | Reinforcement Learning | A toolkit for developing and comparing reinforcement learning algorithms |
OpenAI Universe | Reinforcement Learning | A toolkit for developing and comparing reinforcement learning algorithms, particularly video games |
OpenMapFlow | Geospatial data | Rapid map creation with machine learning and earth observation data. |
Our World in Data Grapher | Visualization | Open source tools for data visualization |
Overleaf | Collaborative Writing | Online LaTeX collaboration |
Papers with Code | Research Searching | Free and open resource with Machine Learning papers, code, datasets, methods and evaluation tables |
Plot.ly for Python | Python | Interactive plotting tools |
PyFormat | Python | Explanation of formatting in Python |
Pytorch | Machine Learning | Open source deep machine learning framework |
Raster Vision | Geospatial data | Raster Vision is an open source Python framework for building computer vision models on satellite, aerial, and other large imagery sets |
Regexr | Regular Expressions | Interactive regular expression playground |
Reinforce JS | Machine Learning | Reinforcement learning package for Javascript |
Rodeo | Python | A Python integrated development environment |
SACRED | Experiment tracking | Sacred is a tool to help you configure, organize, log and reproduce experiments |
Scimago Journal Rankings | Publication Rankings | Searchable list of conference and journal rankings by field |
Scrapy | Web Scraping | Scrape data from the web |
Scrollama | Interactive visualization | Scrollers for interactive web visualizations |
Semantic Sanity | Research Searching | Time-saving tool for searching arxiv.org |
Semantic Scholar | Research Searching | AI-backed search engine for scientific journals |
Snorkel | Machine Learning | Programmatically building and managing training data |
So You Want to Build A Scroller | Interactive visualization | Scrollers for interactive web visualizations |
Solaris | Geospatial data | An open source machine learning pipeline for geospatial imagery |
Stackedit.io | Markdown Language | Markdown editor |
Style Guide for Python Code | Python | Programming style guide |
Tableau | Data visualization | Graphical user interface-based data visualization tool |
Tabula | Data Scraping | Extract data from tables |
Tensorflow | Machine Learning | Open source deep machine learning framework |
Tensorflow JS | Machine Learning | Deep learning package for Javascript |
Tensorflow Playground | Neural Networks | Interactive neural network playground |
Tensorwatch | Debugging | Deep learning debuggin tool from Microsoft |
Testbook | Programming | A unit testing framework for testing code in Jupyter Notebooks |
The Neural Network Zoo | Neural Networks | A graphical cheat sheet for neural network architectures and acronyms |
Tmux | Programming | Terminal multiplexer |
Tuna | Python | Python profiler and performance analysis tool |
Visualization Style Guide | Visualization | Urban Institute visualization style guide - great set of principles to inspire good data visualization practices |
Weights and biases | Experiment tracking | Record metrics, visualize training and share findings |
Xarray | Python | Xarray makes working with labelled multi-dimensional arrays in Python simple, efficient |
Zenodo | Reference Management | Open source data repository |
Zotero | Reference Management | Reference and citation management system for research |