50 Good Toy Problems in Data Science

Here is a long list of good toy problems to choose from:

1) Organize twitter users into groups based on similarity of their tweets. To get started you can use simple metrics such as number of words in the tweet, average word length, standard deviation of word length, etc. Use a simple classifier/clustering algorithm of your choice (e.g. see the chapter on Naive Bayes text classification here: http://nlp.stanford.edu/IR-book/…)
You can use Twitter Streaming API as suggested by Neil Kodner to extract users’ status updates and Enron email classification methods suggested by Josh Wills. Run this on at least 1GB worth of tweets (you can extract it in less than a day unless you’re using a dial-up connection), see if your algorithm scales well. Extract more features with standard NLP methods (see How does one determine similarity between people online?) and try to improve your classifier performance. It would be interesting to see how your groupings compare to Twitter’s ‘Similar Users’ suggestions or TunkRank.
Update from Data 2.0 Conference: You can have full Firehose access now (10,000 keyword filters for 30 cents/hr): http://www.readwriteweb.com/arch…

2) Find similar users on Delicious (product) as suggested by Andreas Stuhlmüller: http://www.aiplayground.org/arti…

3) Explore Where can I find large datasets open to the public? and What data APIs or sources should be in my O’Reilly guide? , http://www.reddit.com/r/datasets/

4) FAQ extraction from mailing lists, see http://mail-archives.apache.org/…

5) Find similar Quora Users by Interests and Segments: see What interesting statistics could be computed from user statistics on Quora?

6) Run some stats on Facebook or Google Profiles. See Pete Warden’s and Paul Butler’s exercises: http://petewarden.typepad.com/se… , http://petewarden.typepad.com/se… , http://paulbutler.org/archives/v…

7) Coupons: http://paulbutler.org/archives/g…

8) http://www.heritagehealthprize.com/

9) What are some good learning projects to teach oneself about machine learning?

10) Kinect: Are there any cool hacks for Kinect?

11) A better spelling corrector: http://norvig.com/spell-correct….

12A) Linear A: See Kim Raymoure’s answer: What are some computational methods used in Linear A decipherment?

12B) Linear B: Quollaboration: Toy Data Analysis for Linear B

13) A murder mystery: http://www.networkworld.com/comm…

14) Michael E Driscoll’s answer to What are some good summer programs for PhD students interested in data science?

15) Object tracking: http://info.ee.surrey.ac.uk/Pers…

16) http://datavizchallenge.org/

17) List the directors that have directed at least 20 movies and acted in all of them, using Internet Movie Database (IMDb) data: http://www.imdb.com/interfaces , http://imdbpy.sourceforge.net/

18) Mashups: http://www.housingmaps.com/ , What data APIs or sources should be in my O’Reilly guide?

19) http://www.hearstchallenge.com/

20) What are some good class projects for machine learning using MapReduce?

21) Videolectures.net recommendations: http://www.r-bloggers.com/videol…

22) Materials identification: http://tunedit.org/challenge/mat…

23) http://www.executablepapers.com/ Also What kind of collaboration tools would reduce duplication of R&D effort in data analysis and sharing?

24) http://overstockreclabprize.com/

25) Data mining competitions: http://www.kaggle.com/ and http://www.kdnuggets.com/dataset…

26) IEEE Vast: http://hcil.cs.umd.edu/localphp/…

27) The Mendeley API: http://dev.mendeley.com/ , http://dev.mendeley.com/datachal…

28) HIV Progression: http://www.kaggle.com/c/hivprogr…

29) Data.gov apps: What are the best apps built on top of open government data?

30) HN search API: http://news.ycombinator.com/item…

31) Optimizing FX Trading Strategies: http://gociop.de/gecco-2011-indu…

32) Yahoo KDD cup: http://kddcup.yahoo.com

33) Analysis of Financial Data with Perl: http://perlmonks.org/index.pl?no…

34) Wide Finder challenge: http://www.tbray.org/ongoing/Whe…

35) Internet Search: http://himmele.blogspot.com/2011…

36) Life Tech: http://www.lifetechnologies.com/…

37) Downloadable patents to play with: http://www.google.com/googlebook…

38) Toy machine learning exercises: http://stackoverflow.com/questio…

39) Assignments in CS 194-16 course at Berkeley: http://datascienc.es/schedule/

40) Topcoder, USPTO and NASA $50k data mining contest: http://community.topcoder.com/nt…

41) Mathworks contests: http://www.mathworks.com/academi…

42) A data mining web app: https://github.com/entaroadun/hn…

43) KDD CUP: http://www.kdd.org/kddcup/

44) What are the best algorithms for classifying the language of a text snippet? Why?

45) Tokenising the visible english text of common crawl:
http://matpalm.com/blog/2011/12/…

46) Build a MixRank clone: mixrank.com

47) Kaggle gesture challenge: http://www.kaggle.com/c/GestureC…

48) Yandex Relevance Prediction Challenge: http://imat-relpred.yandex.ru/en (via KDnuggets: http://www.kdnuggets.com/2011/11… via Jeff Dalton http://twitter.com/#!/JeffD)

49) Hit prediction: http://www.wired.com/underwire/2…

50) Find Facebook Users on Match.com by Using Face Recognition Tools:
http://artemyankov.com/post/1830…

51) Reddit recommender: http://www.reddit.com/r/redditde…

A note about the author: Alex Kamil is an engineer and data scientist and among the most popular writers on Data Science related topics in Quora.

This answer was originally published in Quora by Alex Kamil

Leave a Reply

Your email address will not be published.