Performance improvements for photo serving |

We’ve been working to make Flickr faster for our users around the world. Since the primary photo storage locations are in the US, and information on the internet travels at a finite speed, the farther away a Flickr user is located from the US, the slower Flickr’s response time will be. Recently, we looked at opportunities to improve this situation. One of the improvements involves keeping temporary copies of recently viewed photos in locations nearer to users.  The other improvement aims to get a benefit from these caches even when a user views a photo that is not already in the cache.

It’s Official, Yahoo + Flurry!

Exploring Life Without Compass

Compass is a great thing. At Flickr, we’re actually quite smitten with it. But being conscious of your friends’ friends is important (you never know who they’ll invite to your barbecue), and we’re not so sure about this “Ruby” that Compass is always hanging out with. Then there’s Ruby’s friend Bundler who, every year at the Christmas Party, tells the same stupid story about the time the police confused him with a jewelry thief. Enough is enough! We’ve got history, Compass, but we just feel it might be time to try seeing other people. 

Changes in Flickr tables both v1 and v2

We recently announced that the Flickr API is going SSL-only.
To support this move, we have also restricted the Flickr YQL tables to be available over SSL-only.

All developers using the Flickr YQL tables will need to make the following updates to their API settings by June 24, 2014:

Protocol: HTTPS
Port: 443
The domain name will remain the same.

As of June 24, 2014, we will limit all access to Flickr YQL tables to secure SSL connections only. No Flickr API data will be accessible over HTTP from this date onwards. If you don’t switch the access protocol to HTTPS, your users will not be able to access Flickr data via your service.
Thank you for supporting us and our users in making the shift to HTTPS.

You go to the Flickr Developer Guide for more information. 

Yahoo at Hadoop Summit, San Jose 2014

By Sumeet Singh, Sr. Director, Product Management, Hadoop

Yahoo and Hortonworks are pleased to host the 7th Annual Hadoop Summit - the leading conference for the Apache Hadoop community - on June 3-5, 2014 in San Jose, California.


Yahoo is a major open source contributor to and one of the largest users of Apache Hadoop.  The Hadoop project is at the heart of many of Yahoo’s important business processes and we continue to make the Hadoop ecosystem stronger by working closely with key collaborators in the community to drive more users and projects to the Hadoop ecosystem.

Join us at one of the following sessions or stop by Kiosk P9 at the Hadoop Summit to get an in-depth look at Yahoo’s Hadoop culture.


Hadoop Intelligence – Scalable Machine Learning

Amotz Maimon (@AmotzM) – Chief Architect

"This talk will cover how Yahoo is leveraging Hadoop to solve complex computational problems with a large, cross-product feature set that needs to be computed in a fast manner.  We will share challenges we face, the approaches that we’re taking to address them, and how Hadoop can be used to support these types of operations at massive scale."

Track: Hadoop Driven Business

Day 1 (12.05 PM). Data Discovery on Hadoop – Realizing the Full Potential of Your Data

Thiruvel Thirumoolan (@thiruvel) – Principal Engineer

Sumeet Singh (@sumeetksingh) – Sr. Director of Product Management

"The talk describes an approach to manage data (location, schema knowledge and evolution, sharing and adhoc access with business rules based access control, and audit and compliance requirements) with an Apache Hive based solution (Hive, HCatalog, and HiveServer2)."

Day 1 (4.35 PM). Video Transcoding on Hadoop

Shital Mehta (@smcal75) – Architect, Video Platform

Kishore Angani (@kishore_angani) – Principal Engineer, Video Platform

"The talk describes the motivation, design and the challenges faced while building a cloud based transcoding service (that processes all the videos before they go online) and how a batch processing infrastructure has been used in innovative ways to build a transactional system requiring predictable response times."

Track: Committer

Day 1 (2.35 PM). Multi-tenant Storm Service on Hadoop Grid

Bobby Evans – Principal Engineer, Apache Hadoop PMC, Storm PPMC, Spark Committer

Andy Feng (@afeng76) – Distinguished Architect, Apache Storm PPMC

"Multi-tenancy and security are foundational to building scalable-hosted platforms, and we have done exactly that with Apache Storm.  The talk describes our enhancements to Storm that has allowed us to build one of the largest installations of Storm in the world to offer low-latency big data platform services to entire Yahoo on the common storm clusters while sharing infrastructure components with our Hadoop platform."

Day 2 (1.45 PM). Pig on Tez – Low Latency ETL with Big Data

Daniel Dai (@daijy)– Member of Technical Staff, Hortonworks, Apache Pig PMC

Rohini Palaniswamy (@rohini_aditya) – Principal Engineer, Apache Pig PMC and Oozie Committer

"Pig on Tez aims to make ETL faster by using Tez as the execution as it is a more natural fit for the query plan produced by Pig.  With optimized and shorter query plan graphs, Pig on Tez delivers huge performance improvements by executing the entire script within one YARN application as a single DAG and avoiding intermediate storage in HDFS. It also employs a lot of other optimizations made feasible by the Tez framework."

Track: Deployment and Operations

Day 1 (3:25 PM). Collection of Small Tips on Further Stabilizing your Hadoop Cluster

Koji Noguchi (@kojinoguchi) – Apache Hadoop and Pig Committer

"For the first time, the maestro shares his pearls of wisdom in a public forum. Call Koji and he will tell you if you have a slow node, misconfigured node, CPU-eating jobs, or HDFS-wasting users even in the middle of the night when he pretends he is sleeping."

Day 2 (12:05 PM). Hive on Apache Tez: Benchmarked at Yahoo! Scale

Mithun Radhakrishnan (@mithunrk), Apache HCatalog Committer

"At Yahoo, we’d like our low-latency use-cases to be handled within the same framework as our larger queries, if viable.  We’ve spent several months benchmarking various versions of Hive (including 0.13 on Tez), file-formats, and compression and query techniques, at scale.  Here, we present our tests, results and conclusions, alongside suggestions for real-world performance tuning."

Track: Future of Hadoop

Day 1 (4:35 PM). Pig on Storm

Kapil Gupta – Principal Engineer, Cloud Platforms

Mridul Jain (@mridul_jain) – Senior Principal Engineer, Cloud Platforms

"In this talk, we propose PIG as the primary language for expressing real-time stream processing logic and provide a working prototype on Storm.  We also illustrate how legacy code written for MR in PIG, can run with minimal to no changes, on Storm.  We also propose a “Hybrid Mode” where a single PIG script can express logic for both real-time streaming and batch jobs."

Day 2 (11:15 AM). Hadoop Rolling Upgrades - Taking Availability to the Next Level

Suresh Srinivas (@suresh_m_s) – Co-founder and Architect, Hortonworks, Apache Hadoop PMC

Jason Lowe – Senior Principal Engineer, Apache Hadoop PMC

"No more maintenance downtimes, coordinating with users, catch-up processing etc. for Hadoop upgrades.  The talk will describe the challenges with getting to transparent rolling upgrades, and discuss how these challenges are being addressed in both YARN and HDFS."

Day 3 (11:50 AM). Spark-on-YARN - Empower Spark Applications on Hadoop Cluster

Thomas Graves – Principal Engineer, Apache Hadoop PMC and Apache Spark Committer

Andy Feng (@afeng76) – Distinguished Architect, Apache Storm PPMC

"In this talk, we will cover an effort to empower Spark applications via Spark-on-YARN. Spark-on-YARN enables Spark clusters and applications to be deployed onto your existing Hadoop hardware (without creating a separate cluster). Spark applications can then directly access Hadoop datasets on HDFS."

Track: Data Science

Day 2 (11:15 AM) – Interactive Analytics in Human Time - Lighting Fast Analytics using a Combination of Hadoop and In-memory Computation Engines at Yahoo

Supreeth Rao (@supreeth_) – Technical Yahoo, Ads and Data Team

Sunil Gupta (@_skgupta) – Technical Yahoo, Ads and Data Team

"Providing interactive analytics over all of Yahoo’s advertising data across the numerable dimensions and metrics that span advertising has been a huge challenge. From getting results in a concurrent system back in under a second, to computing non-additive cardinality estimations to audience segmentation analytics, the problem space is computationally expensive and has resulted in large systems in the past. We have attempted to solve this problem in many different ways in the past, with systems built using traditional RDBMS to no-sql stores to commercial licensed distributed stores. With our current implementation, we look into how we have evolved a data tech stack that includes Hadoop and in-memory technologies."

Track: Hadoop for Business Apps

Day 3 (11:00 AM) – Costing Your Big Data Operations

Sumeet Singh (@sumeetksingh) – Sr. Director of Product Management

Amrit Lal (@Amritasshwar) – Product Manager, Hadoop and Big Data

"As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations. Our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. We will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively."

For public inquiries or to learn more about the opportunities with the Hadoop team at Yahoo, reach out to us at bigdata AT yahoo-inc DOT com.

Computer vision at scale with Hadoop and Storm

Recently, the team at Flickr has been working to improve photo search. Before our work began, Flickr only knew about photo metadata — information about the photo included in camera-generated EXIF data, plus any labels the photo owner added manually like tags, titles, and descriptions. Ironically, Flickr has never before been able to “see” what’s in the photograph itself… 

Yahoo Answers APIs Will Be Removed as of June 3, 2014

As you know, Yahoo Answers recently launched a new UI worldwide in 21 markets across Desktop & Mobile.

As part of this continued effort to improve Yahoo Answers for our users, we are focusing our energies and will no longer support the current set of Answers APIs (questionSearch, getByCategory, getQuestion, getByUser) as of June 3, 2014. In addition, the Yahoo Answers API documentation at the Yahoo Developer Network portal will also be removed.

We are working on an enhanced set of APIs and will post any future information in this space.

Questions related to this announcement can be directed to YDN General Discussion forum.

Yahoo Betting on Apache Hive, Tez, and YARN

by The Hadoop Platforms Team

Low-latency SQL queries, Business Intelligence (BI), and Data Discovery on Big Data are some of the hottest topics these days in the industry with a range of solutions coming to life lately to address them as either proprietary or open-source implementations on top of Hadoop.  Some of the popular ones talked about in the Big Data communities are Hive, Presto, Impala, Shark, and Drill.

Hive’s Adoption at Yahoo

Yahoo has traditionally used Apache Pig, a technology developed at Yahoo in 2007, as the de facto platform for processing Big Data, accounting for well over half of all Hadoop jobs till date.  One of the primary reasons for Pig’s success at Yahoo has been its ability to express complex processing needs well through feature rich constructs and operators ideal for large-scale ETL pipelines.  Something that is not easy to express in SQL.  Researchers and engineers working on data systems built on Hadoop at the time found it an order of magnitude better than working with Java MapReduce APIs directly.  Apache Pig settled in and quickly made a place for itself among developers.

Over time and with increased adoption of the Hadoop platform across Yahoo, a SQL or SQL-like solution over Hadoop started to become necessary for adhoc analytics that Pig was not well suited for.  SQL is the most widely used language for data analysis and manipulation, and Hadoop had also started to reach beyond the data scientists and engineers to downstream analysts and reporting teams.  Apache Hive, originally developed at Facebook in 2007-2008, was a popular and scalable SQL-like solution available over Hadoop at the time that ran in batch mode on Hadoop’s MapReduce engine.  While Yahoo adopted Hive in 2010, its use remained limited.

On the other hand, MapReduce, Pig and Hive, all running on top of Hadoop, raised concerns around sharing of data among applications written using these different approaches.  Pig and MapReduce’s tight coupling with underlying data storage was also an issue in terms of managing schema and format changes.  Apache HCatalog, a table and storage management layer was conceived at Yahoo as a result in 2010 to provide a shared schema and data model for MapReduce, Pig, and Hive by providing wrappers around Hive’s metastore.  HCatalog eventually merged with the Hive project in 2013, but remained central to our effort to register all data on the platform in a common metastore, and make them discoverable and sharable with controlled access.

The Need for Interactive SQL on Hadoop

By mid 2012, the need to make SQL over Hadoop more interactive became material as specific use cases and requirements emerged.  At the same time, Yahoo had also undertaken a large effort to stabilize Hadoop 0.23 (pre Hadoop 2.x branch) and YARN to roll it out at scale on all our production clusters.  YARNs value propositions were absolutely clear.  To address the interactive SQL use cases, we started exploring our options in parallel, and around the same time, Project Stinger got announced as a community driven project from Hortonworks to make Hive capable of handling a broad spectrum of SQL queries (from interactive to batch) along with extending its analytics functions and standard SQL support.  Early version of HiveServer2 also became available to address the concurrency and security issues in connecting Hive over standard ODBC and JDBC that BI and reporting tools like MicroStrategy and Tableau needed.  We decided to stick with Hive and participate in its development and phased (Phases I, II, III) delivery.  At this point, Hive also happens to be one of the fastest growing products in our platform technology stack (Fig 1) confirming the fact that SQL on Hadoop is a hot topic for good reasons.

imageFig 1. Growth in Hive jobs relative to overall Hadoop jobs

Why Hive?

So, why did we stick with Hive or as one may say, bet on Hive?  We did an evaluation of available solutions, and stayed the course we were on with Hive as the best solution for our users for several key reasons:

  • Hive is the SQL standard for Hadoop that has been around for seven years, battle tested at scale, and widely used across industries
  • A single solution that works across a broad spectrum of data volumes (more on this in the performance section)
  • HCatalog, part of Hive, acts as the central metastore for facilitating interoperability among various Hadoop tools
  • A vibrant community from many well known companies with top notch engineers and architects vested in its future
  • Top Level Project (TLP) with Apache Software Foundation (ASF) that offers several advantages, including our deep familiarity with ASF and all the related Hadoop ecosystem projects under Apache and the clarity around making contributions to gain influence in the community that may allow Yahoo to evolve Hive in a direction that meets our users needs
  • Perhaps one of the few SQL on Hadoop solutions around that has been widely certified by BI vendors (an important distinction to consider as Hive gets used in many cases by data analysts and reporting teams directly)
  • Alleviating performance concerns with relentless phased delivery (Hive 0.11, 0.12 and 0.13) against the initially stated performance goals

Query Performance on Hive 0.13

Since performance was one of users biggest concerns with Hive 0.10, the version Yahoo was running, we conducted Hive’s performance benchmarks, not to say that the significant facelift in features with later versions of Hive wasn’t important.

In one of the recent performance benchmarks Yahoo’s Hive team conducted on the Jan version of Hive 0.13, we found the query execution times dramatically better than Hive 0.10 on a 300 node cluster.  To give you an idea of the magnitude of performance difference we observed, Fig 2 shows TCP-H benchmark results with 100 GB dataset on Hive 0.10 with RCFile (Row Columnar) format on Hadoop 0.23 (MapReduce on YARN) vs. Hive 0.13 with ORC File (Optimized Row Columnar), Apache Tez on YARN, Vectorization, and Hadoop 2.3).  Security was turned off in both cases.  With Hive 0.13, 18 out of 21 queries finished under 60 seconds with the longest still under 80 seconds.  Also, Hive 0.13 execution times were comparable or better than Shark on a 100 node cluster.

imageFig 2. Hive 0.10 vs. Hive 0.13 on 100 GB of data

On the other hand, Hive 0.13 query execution times were not only significantly better at higher volumes of data (Fig 3 and 4) but also executed successfully without failing.  In our comparisons and observations with Shark, we saw most queries fail with the larger (10TB) dataset.  These same queries ran successfully and much faster on Hive 0.13, allowing for better scale.  This was extremely critical for us, as we needed a single query and BI solution on the Hadoop grid regardless of dataset size.  The Hive solution resonates with our users, as they do not have to worry about learning multiple technologies and discerning which solution to use when.  A common solution also results in cost and operational efficiencies from having to build, deploy, and maintain a single solution.

imageFig 3. Hive 0.10 vs. Hive 0.13 on 1 TB of data

imageFig 4. Hive 0.10 vs. Hive 0.13 on 10 TB of data

The performance of Hive 0.13 is certainly impressive over its predecessors, but one must realize how these performance improvements came by. Several systems rely on caching data in memory to lower latency.  While this works well for some use cases, the approach fails when either the data is too large to fit in the memory or on a shared multi-tenant environment where memory resources have to be shared among tenants or users.  Hive 0.13, on the other hand, achieves comparable performance through ORC, Tez, and Vectorization (vectorized query processing) that does not suffer from the issues noted above.  On the flip side, building solutions in this manner certainly requires heavy engineering investment (100s of man month in case of Hive 0.13 since the start of 2013) for robustness and flexibility.

Looking Ahead

We are excited about the work going on in the Hive community to take Hive 0.13 to the next level in subsequent releases in terms of both features and performance, in particular the Cost-based Query Optimizations and the ability to perform inserts, updates, and deletes with full ACID support.