I came into coding from the traditional research world (environmental science). I started building models in R and Python, and later moved into scientific computing using SciPy and now dabbling with Julia. I’m now focused mostly on machine learning, text mining/natural language processing, and automation in application development and deployment.
Table 1: Long version
| Skill set | Tool set |
|---|---|
Data Management/Information Systems |
|
| SQL | PostgreSQL, MySQL, MS SQL, Sybase |
| NoSQL | MongoDB, SPARQL, Redis |
| Ontologies and Taxonomies | RDF/OWL, Protege, SPARQL, Apache Jena |
Software developement |
|
| Unix environments | shell (bash, zsh), system admin tools |
| Rapid prototyping | R Shiny |
| General programming langauges | Python, Scala, Go, Rust |
| Backend developement | Django, Flask, Node.js |
Analytics and Prediction |
|
| Data munging, cleaning, processing | NumPy, Pandas, R's tidyverse |
| Text mining and Natural Language Processing | Python (NLTK, scikit-learn), R (tm, quanteda) |
| Machine Learning | scikit-learn, TensorFlow, PyTorch |
| Probablility and Inference | hypothesis testing, time series, probability modeling, forecasting, resampling methods, Bayesian methods |
| Simulation | Arena, SimPy |
| Optimization | linear and integer programming, calculus, numerical methods |
| Experimental design | |
| Vizualizations and dashboards | Tableau, R Shiny, ggplot, plotly, Kibana |
DevOps |
|
| Version control and Collaboration | Git, GitHub, Jira |
| Continuous Integration/Delivery | GoCD, Jenkins |
| Configuration/Cluster Management | Ansible, Vagrant, Docker, Kubernetes, Mesos |
| Build Tools | Make, Ant, Maven, Gradle |
| Monitoring | Elasticsearch, Loghash, Kibana (ELK), Icinga |
Table 2:
| Data Management | Software/Web Development | Advanced Analytics |
|---|---|---|
| SQL (PostgreSQL, MySQL) | *nix environments, bash | Text mining and NLP |
| NoSQL (MongoDB, RDF/SPARQL) | Python, R, SAS | Machine learning |
| Data wrangling and cleaning | Pipelines (GoCD) | Classical and Bayesian probability and inference |
| Database/schema design and mgmt | Shiny for rapid protyping | Simulation |
| Ontologies and taxonomies | Django web framework | Optimization |
| Agile development process | Experimental design | |
| git version control | Visualizations, dashboards |
Rust Books: List of books on Rust (programming language)
MOOCs for datascience: List of Massive Open Online Courses (MOOCs) related to Data Science from several sources
Install from requirements: A simple R function that works like Python’s ‘pip install -r requirements.txt’
OReilly Data Show: Copy of the RSS feed for the O’Reilly Data Show podcast to get around a firewall issue with the offical feed