Most people learn by doing rather than reading (myself included) so just pick a project and start building.
This is the journey I took:
- Setup a Hadoop cluster from scratch (start with 4 nodes on virtualbox)
- Write software to crawl and store data on every single torrent. (I dont know why I picked torrents, it was just interesting at the time), but pick a single topic, and then scale it as far as you can.
(Can I store 100,000 torrent files? Can I crawl 200 websites every 5 minutes? Can I index every single file inside the torrent - whoops I have 500,000,000 rows now, can I distribute that across a cluster, can I upgrade the cluster without downtime? Can I swap Hadoop and HBase out for Cassandra? Can I do that with no downtime?) Why aren't all these CPU's being utilised? How can I use redis as a distributed cache? Now the whole system is running, can I scale it 2x, 5x, 10x? What happens if I randomly kill a node?
Just pick a single project - Astronomy Data, Weather Data, Planes in the air, open IoT sensors, IRC chat, Free Satellite Data, Twitter streams, pick a datasource that interests you and then your exercise is to scale it as far as you can - this is an exercise in engineering, not data science, not pure research, the goal is scale.
As you build this you'll do research and discover which technologies are better at scaling for reads, writes, difference consistency guarantees, different querying abilities.
Sure you could read all of this, but unless you apply it, much of it wont stick
This has been my approach thus far. The place where I work allows for practical applications of this (scaling systems to millions of requests per second, having no downtime operations, doing BCP/DR, etc.)
2 things on this
- Learning by doing sometimes feels like spending time on discovering things that would have been obvious given the right resources to look at
- It's still easier for non-functional aspects of system where there can be quicker feedback. The more abstract part to me is functional design of large systems
For e.g., there are great resources for more "basic" patterns (on Java design patterns, Effective Java, Clean Code, etc.). Are there such resources for roles beyond SSE?