Big Data FAQs


Frequently Asked Questions About Big Data


What is Big Data?

Big Data includes data sets whose size is beyond the ability of current software tools to capture, manage, and process the data in a reasonable time.


Who are some of the BIG DATA users?

From cloud companies like Amazon to healthcare companies to financial firms, it seems as if everyone is developing a strategy to use big data. For example, every mobile phone user has a monthly bill which catalogs every call and every text; processing the sheer volume of that data can be challenging. Software logs, remote sensing technologies, information-sensing mobile devices all pose a challenge in terms of the volumes of data created. The size of Big Data can be relative to the size of the enterprise. For some, it may be hundreds of gigabytes, for others, tens or hundreds of terabytes to cause consideration.


What is Hadoop?

The Apache Hadoop software library allows for the distributed processing of large data sets across clusters of computers using a simple programming model. The software library is designed to scale from single servers to thousands of machines; each server using local computation and storage. Instead of relying on hardware to deliver high-availability, the library itself handles failures at the application layer. As a result, the impact of failures is minimized by delivering a highly-available service on top of a cluster of computers. For more info, see this Hadoop FAQ.


Is there an easy way to migrate data from Hadoop into a relational database?

The Hadoop JDBC driver can be used to pull data out of Hadoop and then use the DataDirect JDBC Driver to bulk load the data into Oracle, DB2, SQL Server, Sybase, and other relational databases. DataDirect bulk load provides an optimized, scalable option for efficiently moving data. Throughput is accelerated, and performance is maximized.


When loading results from a Big Data reduction into a relational database with indexing we see some really slow results dealing with such a large index. How can we make it more manageable?

The load operation is actually updating the index while you're loading – the key is to make sure you're not indexing while loading as it causes too many collisions and slows the whole process down. DataDirect Bulk Load offers many options; one of which allows you to postpone indexing during the load process and avoid the collisions that occur when both operations are done together – making your loading and indexing as fast as you expect them to be.


What are some of the user cases where bulk load can be used to handle the huge volumes of Big Data?

  • Data Warehousing – Bulk Load delivers the best performance for loading bulk data into an Oracle, DB2, Sybase, or SQL Server-based data warehouse – while avoiding data latency issues.
  • Data Migration – Bulk Load is ideal for extract and load data migration operations – whether simple or more complex.
  • Data Replication – Rather than use FTP or pushing files around a network, Bulk Load functionality can allow for quickly loading data into relational database tables. This approach is faster and provides the benefit of storing the data as a relational database table easily accessed by reporting or Business Intelligence applications.
  • Disaster Recovery – Bulk Load can ensure that the data is quickly and easily replicated into disaster recovery databases.
  • Cloud Data Publication – Bulk Load allows developers to quickly and easily build a simple program that publishes Big Data into the cloud in a way which uses resources efficiently.


Are there any data integrity checks /constraints needed to be removed/added from the target database for using the bulk load?


When starting a bulk load data movement on the socket, the drivers sends the first packet over and it indicates that a bulk load is starting. That removes integrity restraints. With DataDirect Bulk Load, there is the option to validate the data first. This would help you gauge the probability for success of loading this data before attempting the bulk load. The driver will check the Metadata. The Metadata of the data is compared with that of the destination.