I came into coding from the traditional research world (environmental science). I started building models in R and Python, and later moved into scientific computing using SciPy and now dabbling with Julia. I’m now focused mostly on machine learning, text mining/natural language processing, and automation in application development and deployment.
Table 1: Long version
Skill set | Tool set |
---|---|
Data Management/Information Systems |
|
SQL | PostgreSQL, MySQL, MS SQL, Sybase |
NoSQL | MongoDB, SPARQL, Redis |
Ontologies and Taxonomies | RDF/OWL, Protege, SPARQL, Apache Jena |
Software developement |
|
Unix environments | shell (bash, zsh), system admin tools |
Rapid prototyping | R Shiny |
General programming langauges | Python, Scala, Go, Rust |
Backend developement | Django, Flask, Node.js |
Analytics and Prediction |
|
Data munging, cleaning, processing | NumPy, Pandas, R's tidyverse |
Text mining and Natural Language Processing | Python (NLTK, scikit-learn), R (tm, quanteda) |
Machine Learning | scikit-learn, TensorFlow, PyTorch |
Probablility and Inference | hypothesis testing, time series, probability modeling, forecasting, resampling methods, Bayesian methods |
Simulation | Arena, SimPy |
Optimization | linear and integer programming, calculus, numerical methods |
Experimental design | |
Vizualizations and dashboards | Tableau, R Shiny, ggplot, plotly, Kibana |
DevOps |
|
Version control and Collaboration | Git, GitHub, Jira |
Continuous Integration/Delivery | GoCD, Jenkins |
Configuration/Cluster Management | Ansible, Vagrant, Docker, Kubernetes, Mesos |
Build Tools | Make, Ant, Maven, Gradle |
Monitoring | Elasticsearch, Loghash, Kibana (ELK), Icinga |
Table 2:
Data Management | Software/Web Development | Advanced Analytics |
---|---|---|
SQL (PostgreSQL, MySQL) | *nix environments, bash | Text mining and NLP |
NoSQL (MongoDB, RDF/SPARQL) | Python, R, SAS | Machine learning |
Data wrangling and cleaning | Pipelines (GoCD) | Classical and Bayesian probability and inference |
Database/schema design and mgmt | Shiny for rapid protyping | Simulation |
Ontologies and taxonomies | Django web framework | Optimization |
Agile development process | Experimental design | |
git version control | Visualizations, dashboards |
Rust Books: List of books on Rust (programming language)
MOOCs for datascience: List of Massive Open Online Courses (MOOCs) related to Data Science from several sources
Install from requirements: A simple R function that works like Python’s ‘pip install -r requirements.txt’
OReilly Data Show: Copy of the RSS feed for the O’Reilly Data Show podcast to get around a firewall issue with the offical feed