Yahoo at Hadoop Summit, San Jose 2014

By Sumeet Singh, Sr. Director, Product Management, Hadoop

Yahoo and Hortonworks are pleased to host the 7th Annual Hadoop Summit - the leading conference for the Apache Hadoop community - on June 3-5, 2014 in San Jose, California.


Yahoo is a major open source contributor to and one of the largest users of Apache Hadoop.  The Hadoop project is at the heart of many of Yahoo’s important business processes and we continue to make the Hadoop ecosystem stronger by working closely with key collaborators in the community to drive more users and projects to the Hadoop ecosystem.

Join us at one of the following sessions or stop by Kiosk P9 at the Hadoop Summit to get an in-depth look at Yahoo’s Hadoop culture.


Hadoop Intelligence – Scalable Machine Learning

Amotz Maimon (@AmotzM) – Chief Architect

"This talk will cover how Yahoo is leveraging Hadoop to solve complex computational problems with a large, cross-product feature set that needs to be computed in a fast manner.  We will share challenges we face, the approaches that we’re taking to address them, and how Hadoop can be used to support these types of operations at massive scale."

Track: Hadoop Driven Business

Day 1 (12.05 PM). Data Discovery on Hadoop – Realizing the Full Potential of Your Data

Thiruvel Thirumoolan (@thiruvel) – Principal Engineer

Sumeet Singh (@sumeetksingh) – Sr. Director of Product Management

"The talk describes an approach to manage data (location, schema knowledge and evolution, sharing and adhoc access with business rules based access control, and audit and compliance requirements) with an Apache Hive based solution (Hive, HCatalog, and HiveServer2)."

Day 1 (4.35 PM). Video Transcoding on Hadoop

Shital Mehta (@smcal75) – Architect, Video Platform

Kishore Angani (@kishore_angani) – Principal Engineer, Video Platform

"The talk describes the motivation, design and the challenges faced while building a cloud based transcoding service (that processes all the videos before they go online) and how a batch processing infrastructure has been used in innovative ways to build a transactional system requiring predictable response times."

Track: Committer

Day 1 (2.35 PM). Multi-tenant Storm Service on Hadoop Grid

Bobby Evans – Principal Engineer, Apache Hadoop PMC, Storm PPMC, Spark Committer

Andy Feng (@afeng76) – Distinguished Architect, Apache Storm PPMC

"Multi-tenancy and security are foundational to building scalable-hosted platforms, and we have done exactly that with Apache Storm.  The talk describes our enhancements to Storm that has allowed us to build one of the largest installations of Storm in the world to offer low-latency big data platform services to entire Yahoo on the common storm clusters while sharing infrastructure components with our Hadoop platform."

Day 2 (1.45 PM). Pig on Tez – Low Latency ETL with Big Data

Daniel Dai (@daijy)– Member of Technical Staff, Hortonworks, Apache Pig PMC

Rohini Palaniswamy (@rohini_aditya) – Principal Engineer, Apache Pig PMC and Oozie Committer

"Pig on Tez aims to make ETL faster by using Tez as the execution as it is a more natural fit for the query plan produced by Pig.  With optimized and shorter query plan graphs, Pig on Tez delivers huge performance improvements by executing the entire script within one YARN application as a single DAG and avoiding intermediate storage in HDFS. It also employs a lot of other optimizations made feasible by the Tez framework."

Track: Deployment and Operations

Day 1 (3:25 PM). Collection of Small Tips on Further Stabilizing your Hadoop Cluster

Koji Noguchi (@kojinoguchi) – Apache Hadoop and Pig Committer

"For the first time, the maestro shares his pearls of wisdom in a public forum. Call Koji and he will tell you if you have a slow node, misconfigured node, CPU-eating jobs, or HDFS-wasting users even in the middle of the night when he pretends he is sleeping."

Day 2 (12:05 PM). Hive on Apache Tez: Benchmarked at Yahoo! Scale

Mithun Radhakrishnan (@mithunrk), Apache HCatalog Committer

"At Yahoo, we’d like our low-latency use-cases to be handled within the same framework as our larger queries, if viable.  We’ve spent several months benchmarking various versions of Hive (including 0.13 on Tez), file-formats, and compression and query techniques, at scale.  Here, we present our tests, results and conclusions, alongside suggestions for real-world performance tuning."

Track: Future of Hadoop

Day 1 (4:35 PM). Pig on Storm

Kapil Gupta – Principal Engineer, Cloud Platforms

Mridul Jain (@mridul_jain) – Senior Principal Engineer, Cloud Platforms

"In this talk, we propose PIG as the primary language for expressing real-time stream processing logic and provide a working prototype on Storm.  We also illustrate how legacy code written for MR in PIG, can run with minimal to no changes, on Storm.  We also propose a “Hybrid Mode” where a single PIG script can express logic for both real-time streaming and batch jobs."

Day 2 (11:15 AM). Hadoop Rolling Upgrades - Taking Availability to the Next Level

Suresh Srinivas (@suresh_m_s) – Co-founder and Architect, Hortonworks, Apache Hadoop PMC

Jason Lowe – Senior Principal Engineer, Apache Hadoop PMC

"No more maintenance downtimes, coordinating with users, catch-up processing etc. for Hadoop upgrades.  The talk will describe the challenges with getting to transparent rolling upgrades, and discuss how these challenges are being addressed in both YARN and HDFS."

Day 3 (11:50 AM). Spark-on-YARN - Empower Spark Applications on Hadoop Cluster

Thomas Graves – Principal Engineer, Apache Hadoop PMC and Apache Spark Committer

Andy Feng (@afeng76) – Distinguished Architect, Apache Storm PPMC

"In this talk, we will cover an effort to empower Spark applications via Spark-on-YARN. Spark-on-YARN enables Spark clusters and applications to be deployed onto your existing Hadoop hardware (without creating a separate cluster). Spark applications can then directly access Hadoop datasets on HDFS."

Track: Data Science

Day 2 (11:15 AM) – Interactive Analytics in Human Time - Lighting Fast Analytics using a Combination of Hadoop and In-memory Computation Engines at Yahoo

Supreeth Rao (@supreeth_) – Technical Yahoo, Ads and Data Team

Sunil Gupta (@_skgupta) – Technical Yahoo, Ads and Data Team

"Providing interactive analytics over all of Yahoo’s advertising data across the numerable dimensions and metrics that span advertising has been a huge challenge. From getting results in a concurrent system back in under a second, to computing non-additive cardinality estimations to audience segmentation analytics, the problem space is computationally expensive and has resulted in large systems in the past. We have attempted to solve this problem in many different ways in the past, with systems built using traditional RDBMS to no-sql stores to commercial licensed distributed stores. With our current implementation, we look into how we have evolved a data tech stack that includes Hadoop and in-memory technologies."

Track: Hadoop for Business Apps

Day 3 (11:00 AM) – Costing Your Big Data Operations

Sumeet Singh (@sumeetksingh) – Sr. Director of Product Management

Amrit Lal (@Amritasshwar) – Product Manager, Hadoop and Big Data

"As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations. Our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. We will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively."

For public inquiries or to learn more about the opportunities with the Hadoop team at Yahoo, reach out to us at bigdata AT yahoo-inc DOT com.