What is Spark?
Apache® Spark™ is an open-source cluster computing framework with in-memory processing to speed analytic applications up to 100 times faster compared to technologies on the market today. Developed in the AMPLab at UC Berkeley, Apache Spark can help reduce data interaction complexity, increase processing speed, and enhance mission-critical applications with deep intelligence.
Highly versatile in many environments, Apache Spark is known for its ease of use in creating algorithms that harness insight from complex data. Spark was elevated to a top-level Apache Project in 2014 and continues to expand today.
IBM is committing to the Apache Spark project with investments in design-led innovation and broad-scale education programs to promote open source innovation and accelerate intelligence into every application.
The Basics & Getting Started
Remember, to be eligible for prizes, submissions must an Apache Spark application that addresses a real business problem or core business concern related to customer care, marketing, risk management, or operations. Your Apache Spark application should be something that a business’s stakeholders could use and deploy in the future.
To meet the minimum eligibility requirements you must:
- Identify a business need that could be informed by data analysis.
- Find data (publicly available data or data from your own business) to analyze using Apache Spark to inform that business need.
- Analyze the data using Apache Spark and share your analysis code for judging (via GitHub or privately shared file). (Applications are encouraged – though not required – to be portable, with the ability to run on different cloud platforms.)
- Showcase your analysis output by including a visual (graphic) or textual explanation of your results, and a video demo explaining your process and outcomes. (Pro tip: we recommend explaining how your entry meets the judging criteria in your video demo or text description, such as the portability of your app.)
OPTIONAL: You may develop a working application that utilizes your Spark-analyzed data to help solve a business need, but this is not required to be eligible for a prize.
Business Need Examples
Not sure what business need to focus on? Your Spark application could:
- Work to support marketing by improving upon an existing “propensity to buy” model to create better offers.
- Support operations by focusing on the design strategy for optimizing shipments of raw materials or scheduling workers.
- Support risk management by monitoring user profiles to build better models for behavior patterns for commerce websites.
We get it: learning a new platform or tool can be overwhelming. You need samples and resources to make your submission great. That’s why we’ve put together this list of sample applications, services, and platforms to help you make your best Spark app!
- Download Apache Spark
- Sample Applications on HackSpark
- The Red Rock Application – Example of a Spark app that’s usable in a business setting by non-technical end users with diverse backgrounds
- Analytics for Apache Spark on IBM Bluemix
- Apache Toree - Provides applications with a mechanism to interactively and remotely access Apache Spark
- Eclairjs - Node.js API for Apache Spark with remote client
- Apache SystemML - Distributed and declarative machine learning platform
- The StackOverflow tag apache-spark is an unofficial but active forum for Spark users' questions and answers.
- Salt (http://unchartedsoftware.github.io/salt/): Built on Apache Spark, Salt generates scalable representations of billions of data points for creating visualizations, including geographic heatmaps, cross-plots or time series, and the layering of multiple data sources and dimensions for contextual overlay.
- SparkPipe (http://unchartedsoftware.github.io/sparkpipe-core/): SparkPipe is a data pipeline for Spark that facilitates the creation and reuse of modular operations and logic blocks. Operations can be chained in series or used to create complex dependency graphs.
What to use Bluemix for your Spark App?
Typically, Bluemix services are available for free for 30 days. However, if you sign up as a participant in the Apache Spark Makers Build hackathon, you’ll have access to Bluemix services from now until November 1, 2016. Request your promo code here. (Note: Bluemix use is not required for participation in this hackathon.)
We understand that you may not have access to, or want to use, your own company’s data. With that in mind, you’re welcome to analyze publicly available data or your own. Here are some datasets to get you started. You can use one of these or any other data that is publicly available and that you have the rights to use.
NEW! Medicare cost and medical research study data from 4Quant (Data available in JSON Dataframe and raw formats; Simple DBC Scala Notebook Example also available): We invite you to investigate and discover new links between two very different datasets scrapped from the Medical Area.
- The first dataset is cost and usage information from Medicare covering in aggregate all of the USA over many years
- The second dataset is the study and research output covering all the medical publications worldwide.
- 2.2M reviews and 591K tips by 552K users for 77K businesses
- 566K business attributes, e.g., hours, parking availability, ambience.
- Social network of 552K users for a total of 3.5M social edges.
- Aggregated check-ins over time for each of the 77K businesses
- 200,000 pictures from the included businesses
- NEW! Bitcoin network transactions (13 GB - csv): This data is the result of processing the Bitcoin blockchain as of mid-2014 through the process described here. It consists of a number of different files describing the transactions, Bitcoin addresses used, and inferred "users." Users of multiple public keys are inferred using the process here. Descriptions of the data structures are available here. To access the data, download the following files and recombine before untar/gunzip. Combine using: cat bitcoin-2014-05.tgz.part-* > bitcoin-2014-05.tgz
- NEW! Amazon Review Graph (3 GB - graphml): This dataset is a graph of Amazon product reviews from 1995 to 2003 - 2M nodes representing customers and products, and 10M edges representing reviews. The data was originally collected in 2003 by Jure Leskovec and published by the Stanford Network Analysis Project (SNAP) for research (https://snap.stanford.edu/data/). All real customer data has been removed, including the contents of each review. Customer names have been synthetically generated.
- NEW! Non-profit IRS filings on AWS - This newly released dataset from the IRS includes nonprofits’ annual Form 990 filings, which provide details on program expenses, salaries, and more. More than 60% of Form 990s are filed digitally, according to the IRS. Previously, those forms were only available as images; now the IRS is publishing them as analysis-friendly XML files. (You can also download the data in bulk from the Internet Archive, thanks to Carl Malamud, the public domain advocate who led the fight for 990s-as-XML.) One early observer noted that the some of the data was misformatted, and has provided instructions for fixing it. <-- Big thanks to Jeremy Singer-Vine and his Data is Plural newsletter for letting us know about this great resource.
- World Management Survey (WMS) - WMS is an international research initiative to measure the difference in management practices across organizations and countries. (Note: You will need to register for access to this data and confirm via checkbox that you will not use the data for financial or commercial means. Devpost has confirmed with the data owners that this data may be used for this hackathon. Please contact the data owners about any other use.)
- World Bank Enterprise Surveys - World Bank Enterprise Surveys provide firm-level data from over 135,000 establishments in 135 countries. Data is used to create over 100 indicators that benchmark the quality of the business environment across the globe. Each country is surveyed every 3 to 4 years.
- OECD.Stat is the statistical online platform of the Organisation for Economic Co-operation and Development (OECD). Using this platform users can search and access OECD’s statistical databases. OECD.Stat includes data and metadata for OECD countries and selected non-member economies.
NEW! Weather Company APIs - For the duration of the hackathon, the Weather Company is providing all participants with access to their APIs — including alerts, forecasts, air quality reports, radar and satellite data, and more. You have the option to use this data as a complementary resource to the enterprise data that will form the core of your project. For instance, weather data can be cross-referenced with sales or consumer review data to see if there's a relationship; regional weather data can be used to assess freight operations impact, etc.Weather Company API key and documentation
- IBM Bluemix: Insights for Twitter - You could use this data to understand sentiment or popular trends. Such information could be incorporated into marketing campaigns.
- IBM Bluemix: Insights for Weather - You could use this data to help retail plan inventory based on weather - for example stock up on winter items for a blizzard. In addition, weather can also help plan staffing requirements for smaller retailers who might slow down if its sunny outside or pick up during periods of bad weather.
- IBM Bluemix: Analytics Exchange - This resource includes a series of interesting data sets available from a variety of places including things like AirBnB and Tourism to things like science and tech.
- Awesome Public Datasets via Github
- Economic data
- Open Data Project Canada
- Open Data Project UK
- Factual -- location information: Includes a database of over 65 million local businesses and points of interest in 50 countries accessible via API or download.
- Open Data by Socrata
- National Bureau of Economic Research
- American Economic Association
- Economic Indicators and Releases
- U.S. Census Center for Economic Studies
- NBER Research papers for download
City or Regional Public Data
Still data hungry? Here are a some of our favorite city, regional, and national public datasets to diversivy your options.
- NYC 311 Service Request from 2010 to Present
- New York State’s Metropolitan Transportation Authority (MTA)'s Turnstile/Subway exit/ entrance data
- New York State’s Metropolitan Transportation Authority (MTA)'s Historical BusTime data
- Average House Prices, Borough - London Datastore
- CitiBike (2013-2014)
- Ebola data - Worldwide Flights
- Ebola data - Non-Governmental Organizations Responding to Ebola
- Ebola data - Sub-national time series data on Ebola cases and deaths in Guinea, Liberia, Sierra Leone, Nigeria and Senegal since March 2014
- Traffic Light Data (US)
- USDA Nutritional Data
- Union Army Dataset
- NYC Open Data
- State of New York Open Data
- LA City Data
- DataSF (San Francisco)
- Data Boston
- City of Chicago Data Portal
- City of Vancouver - Open Data Catalogue
- Data BC (British Columbia)
- New York State Department of Transportation Traffic Data
- The World Bank
- United States Census Bureau
- San Francisco City Infrastructure Case Data
- New York City Social Services Daily Reports
- Chicago Red Light Camera Violations
- EPA Air Pollution Data
- US Road Traffic Data
- US Population Estimates Data (and Projections)
- USDA Adoption of Genetically Engineered Crops in the U.S.
Have a question?
We’re here to help, and we welcome your questions. Please send us a note via the Discussion Board, or email firstname.lastname@example.org.