Saturday, August 26, 2023

 

 

 

 

 

 

 

Term Research Paper

An Analysis of Hadoop and NoSQL

Richard Garling

AMU

INFO620

12/23/2018

Dr. Larson

 

 


 

Abstract

The use of Hadoop and NoSQL data applications has grown exponentially over the past decade. Apache Hadoop is a collection of open source software utilities that allow using a network of computers to solve problems using massive amounts of data and computation. There are many potential applications for using Hadoop, especially involving big data applications. NoSQL created in the 1960s and was only recently labeled NoSQL. It has seen a new interest because of its adaptability to Web 2.0 applications. NoSQL provides storage and retrieval of data by means other than the traditional tabular relations common in DBMS applications.  The use of Hadoop and NoSQL data applications has grown exponentially over the past decade. Apache Hadoop is a collection of open source software utilities that allow using a network of computers to solve problems using massive amounts of data and computation. There are many potential applications for using Hadoop, especially involving big data applications. Retail transactional databases, for example, are generating data with high emphasis on five V’s i.e. variety, volume, veracity, velocity, and value.



 

 

 

 

 

 

 

 

 

Contents

Abstract 2

Introduction. 4

Summary of References. 6

Distributed Database Systems (DDBS) 11

Advantages. 11

Transparency. 14

Disadvantages. 15

NoSQL. 16

Hadoop. 18

Application to Web Search. 21

Information Retrieval (IR) 21

Search Engine. 23

Apache Solr 23

Relevancy. 24

Conclusion. 26

References: 27

 

 


 

Introduction

The use of Hadoop and NoSQL data applications has grown exponentially over the past decade. Apache Hadoop is a collection of open source software utilities that allow using a network of computers to solve problems using massive amounts of data and computation. There are many potential applications for using Hadoop, especially involving big data applications.

NoSQL created in the 1960’s only recently being labeled NoSQL. It has seen a new interest because of its adaptability to Web 2.0 applications. NoSQL provides storage and retrieval of data by means other than the traditional tabular relations common in DBMS applications. 

The question I propose to research involves the application of Hadoop and NoSQL to online retail companies such as Bradford Exchange and Hammacher Schlemmer. While Amazon would be an easy choice considering its size and breadth, I work for the two companies above and the desire and need to learn how these applications would work for smaller companies is greater. One is not comparing apples to apples when analyzing how Hadoop or NoSQL would work on a smaller company by using a much larger company as an example. The question: Can Hadoop and NoSQL applications help in conversions when customers are searching for a product on a retail eCommerce website?

The purpose of this research is to gain an understanding of effective on-site search engine tools that use Hadoop or NoSQL, such as Solr, and how effective are these technologies at increasing conversions in query search on an eCommerce site. An increase in conversions could amount to millions of dollars in sales yearly.

Many retail organizations, both online and traditional brick, and mortar, use transactional databases which generate huge amounts of data, putting a high emphasis on five V’s, i.e. variety, volume, veracity, velocity, and value. Additionally, the exponential growth in eCommerce over the last ten years are using systems like Hadoop and NoSQL to generate online strategies attracting customers away from the traditional brick and mortar. Bradford Exchange and Hammacher Schlemmer are just beginning to develop systems using Solr as a search engine that utilizes Hadoop. Solr is an Open Source Software (OOS), that is part of the Apache Open Source Software Consortium.

 

Summary of References

Elmasri, R., & Navathe, S. B. (2016). Fundamentals of database systems (7th ed.)

NoSQL and Hadoop are a class of systems developed to manage large amounts of data. Organizations such as Amazon, Sears, LinkedIn, Instagram, Facebook, and Twitter use NoSQL and Hadoop in their applications including web links, profiles, tweets and posts as well as sales and marketing. Big data uses Hadoop and NoSQL to mine vast amounts of data searching for patterns otherwise missed with standard DBMS’s. The term NoSQL is mistakenly interpreted to mean no SQL. Those who interpret it using that meaning would be sadly mistaken. Its meaning conveys that it is not strictly SQL, that NoSQL uses other non-traditional means of gathering data that suits the application. Most NoSQL systems are distributed data or storage systems focusing on semi-structured data. Data replication, high performance, scalability, and availability are stressed in NoSQL systems. NoSQL systems are the opposite to traditional database systems where structured data storage with powerful query languages and data consistency are the norm.

Big Data applications commonly use Hadoop. Big data refers to datasets whose size is very large. The size of Big Data is beyond the capabilities of traditional DBMS’s to capture, store, retrieve, and manipulate the data in any useful format. Big data systems sizes measure in the terabyte, petabyte, or Exabyte realm. Hadoop came about in the search for an Open Source Software (OSS) search engine. Hadoop started as a search tool called Nutch, developed by Doug Cutting and Mike Carafella. Nutch could crawl and index hundreds of millions of web pages at a time. By combining ideas that came from a Google File system paper and a Map Reduce paper, Cutting and Carafella realized they could use the ideas presented in those papers to help improve their search engine. Hadoop is the result of that effort.

Elmasri and Navathe’s Fundamentals of database systems will be able to help in developing a technical understanding of how Hadoop and NoSQL work. From the technical analysis, one can gain a better understanding of the workings of these two database systems.

 Karambelkar, Hrishikesh Vijay. (2015). Scaling big data with Hadoop and Solr: understand,

            design, build, and optimize your big data search engine with Hadoop and Apache Solr

            (2nd ed.). Birmingham, England: Packt Publishing

The growth of major enterprise systems and internet properties such as eCommerce as produced the need to build scalable search capable of handling a huge amount of data quickly. Apache Solr is one such OSS search engine popularly used today. Solr is a feature-rich, scalable open source widely used in eCommerce today. It is based on the Lucene search engine and is used by such companies as Sears, Stub Hub, and Zappos. Karambelkar and Hrishikesh’s book is intended to help build high-performance search engines in big data enterprise system with the help of Hadoop and Solr. The book introduces its readers to Apache Hadoop and its ecosystems HDFS and MapReduce showing how to program in each to produce database systems clusters, configure and administer them. The authors give their readers a thorough understanding of how to use Solr, how to configure and administer Solr, as well as use it to handle machine learning techniques. Scaling big data with Hadoop and Solr: understand, design, build, and optimize your big data search engine with Hadoop and Apache Solr will prove to be a useful resource in understanding the practical application of using Hadoop and Solr in searching eCommerce sites on the web.

K V, R., & Kavya, N. (2016). Trend Analysis of E-Commerce Data using Hadoop Ecosystem. International

            Journal of Computer Applications

Trend Analysis of eCommerce Data using Hadoop Ecosystems describes using trend analysis to estimate future events and to approximate uncertain past events. But estimating or approximating events from huge amounts of data can be a daunting task if one wants to draw anything meaningful from the data. Using traditional DBMS’s was proving cumbersome when trying to gather large amounts of data and deciphering trends from it. One of the best search tools to use in querying and analyzing huge amounts of data in a distributed architecture has been Apache Hadoop. Kavya, KV and R addressed the experimental work currently going on Trend analysis in big data and how Hadoop has proven to be the optimal solution for delivering the best querying results using a parallel processing framework to analyze huge amounts of data sets using Map Reduce on top of Hadoop. This process will help to gain a better understanding of applying Hadoop and NoSQL in practical applications.

 

Verma, N., & Singh, J. (2017). An intelligent approach to Big Data analytics for the sustainable retail environment using the Apriori-MapReduce framework.  

Verma and Singh purpose of their study was to explore limitations in traditional data mining systems in extracting buying trends and patterns from retail databases. These retail databases are becoming huge transactional data system filled with the buying habits of millions of customers. Understanding this data has become big business to many and tweaking this data so that a company’s conversion rate jumps just 1% can mean millions of dollars in sales. Verma and Singh’s study analyzes how to draw many eCommerce customers away from their computers and into the brick and mortar stores of traditional retail. They aim to develop algorithms using Map-Reduce and MR-Apriori associations and Hadoop based intelligent cloud architecture to address these issues efficiently. They theorize that by using these tools, they can easily derive buying trends from the data that will help attract customers to local retail stores.

 

Fattal, T. (2014). Creation of an Analytics platform for an on-site eCommerce search engine.

Thomas Fattal is the co-founder of Findify (https://findify.io/) a provider of search engines for eCommerce sites. His Master’s Thesis concentrates on the creation of a platform that analyzes users actions and behaviors in which to provide identified user trends that a retailer can act. Fattal's system is capable of tracking buyers at numerous levels, are a fault-tolerant and scalable architecture. It is capable of saving all logs and transferring these logs to a centralized system for processing. From the analysis, Findify sends a weekly report that the retailer verifies containing service level details and analytical insight that can turn into actionable items. Since Findify is currently a production system, we can draw real-life examples of how a Hadoop based system works on in action. The system can receive thousands of logs per second and is scalable.

  Jaco Aucamp. (2014). eCommerce Site Search Engine - Best Practise. Unpublished. https://doi.org/10.13140/RG.2.1.2297.1042.

The purpose of Jaco Aucamp’s study was to show the importance of using the concepts of relevancy in search results in eCommerce. Relevancy is vital to the success of an eCommerce website. Mr. Aucamp analyzes some search engines commonly in use today; Solr, Elastic, Endeca, Celebros, and SLI. Many search engines are based on Apache Lucene; search engines that lay on top of Lucene. The key here is to deliver search results that are relevant to what the buyer seeks. Amazon and Wal-Mart have been using these tools very effectively. Companies like Bradford Exchange are constantly experimenting trying to find that right mix that will deliver relevant product suggestions based on the terms in the search query. Aucamp’s study provides more real-life examples of how NoSQL and Hadoop systems apply in business environments.

 

 

Distributed Database Systems (DDBS)

During the late 20th century several distributed data systems were developed to handle the problems with data distribution, replication, query, and transactional processing, metadata administration, and numerous other topics. Several new technologies are combining distributed and data technologies. Many industries including insurance and retail are at the forefront of the development of these new data technologies.

A distributed database is a collection of numerous logically interrelated databases distributed over a network with the database management system (DBMS) as the software that manages those databases. Much of how the system operates is transparent to the user.

A distributed database (DDB) must satisfy certain conditions to be called a DDB: it is required to consist of a collection of nodes over a computer network used to transmit the data and commands amongst the different sites and nodes. A connection point that can receive, create or store and send data along a distributed network route is known as a network node. There must be a logical relationship defined between the databases in a network. The hardware and the software need not be alike between the various nodes; the sites can co-locate within the same facility on a local area network or a wide area network spread around the globe.

Advantages

Two common advantages for distributed systems are reliability and availability. That probability of the system not being down and is running at a given point in time is the definition of reliability. The probability the system is available during a given period defines availability. Reliability and availability are directly related to the failures, errors, and faults associated with a database. Failures are aberrations in the behavior of the system opposite normal and correct completion of operations. Errors are the subsets of system state that cause the failure, while the fault is the basis of the error.

Fault tolerance is where we build into the system mechanisms to handle faults by detection and removal. This common approach stresses fault tolerance by recognizing that faults occur. More stringent controls can be applied ensuring the system doesn’t fail. These stringent controls involve a strict design process with extensive testing and quality control. While hardware failures are to be expected from time to time, systems can be built to mitigate such factors. As long as database consistency and data reliability are not compromised, a Database Manager can deal with failures due to transaction and hardware errors or faults. Hardware failures can be caused by main memory losses or secondary storage losses. Network failures are usually due to message or line failures.

Scalability and partition tolerance; scalability defines the expandability of the system without interruption to operations. There are two types of scalability:

1.     Horizontal Scalability concerns the ability of the system to distribute data to old and new nodes as nodes are added.

2.     Vertical Scalability concerns the ability to expand capacity or processing power of individual nodes.

As the system expands, faults in the system may cause the nodes to partition into groups of nodes. A sub-network keeps the partitioned nodes connected, but communications may be lost during expansion. Partition tolerance is where the system keeps operating without losing capacity while partitioning is occurring.

Other important advantages include:

·       Geographically located sites improve the ability for application development due to the transparency of data distribution and control. The user need not know the location of data iso long as accessibility occurs.

·       With data spread out over many sites, nodes, on one site may fail, other sites will continue to operate despite the one site failure. Isolating the faults to their source of origin limits the overall effect to the system. Only the data at the one site of failure cannot be accessed. Centralized system failures cause the whole system to become inaccessible.  While the data on the failed site is unavailable, some of the data may be available on other sites due to replication.

·       Performance improved due to data located at numerous sites rather than a single location. Data localization lessons the demand for network CPU or I/O services reducing access delays in wide area networks. Queries and transactions tend to work quicker and more efficiently with smaller localized databases. By executing multiple queries at different sites or breaking the query into multiple parallel run sub-queries, interquery and intraquery can be achieved adding to system performance improvements.

·       The expansion is much easier in a distributed system than a centralized system. Considering that performance improves due to data located at numerous sites, adding a new site should easier since it shouldn’t have any impact on the rest of the system due to scalability tolerances built into the system. Total transparency provides database users with a view of the total system as if it is one centralized system (Karambelkar, Hrishikesh Vijay, 2015).

Transparency

Data distribution transparency (also known as network or organization transparency) allows the user the freedom from knowing operational details of the location of data or the network in a DDB. The system has location transparency and naming transparency the user need not know.

Location transparency eludes to the command used to perform a task can be autonomous of the location of the data or the location node that issued the command.

Naming transparency says that once an object is named, it can be accessed explicitly without knowledge of its location in the network.

Replication transparency refers to copies of the same data is replicated in numerous locations allowing for quicker availability and access to the data. The user is unaware of the number of copies of the data and most likely doesn’t care.

Decisions need to be made in a distributed database system concerning which sites to store portions of the data before deciding where to store the data decisions concerning the logical units of the database sites in the system. The relations themselves are usually the simplest logical units since each whole relationship needs to be stored at a specific site. Many times, the data is stored on the site nearest the users who need to access the data the most. Horizontal fragmentation, or sharding, is useful for partitioning each relation to the right department.

Fragmentation transparency is made up of two types; Horizontal (also known as sharding) and Vertical Fragmentation. Horizontal fragmentation distributes a relation (a table) into sub-relations which are subsets of the tuples (rows) in an original relation. Newer systems refer to this as sharding, which is horizontal partitioning of data in a database or search engine. Sharding separates very large databases into smaller more manageable datasets. These smaller datasets are easier to manage, faster and more efficient. Horizontal fragmentation is a subset of the tuples in a relation. These tuples in the relation can be used in a condition on one or more attributes in the relationship. Many times, only a single defined attribute exists in the condition. Sharding takes a relation and divides it horizontally, grouping rows to create subsets of tuples, where each subset has a logical meaning for assigning to different nodes.

In a horizontal fragment on a relation, S can be quantified in relational algebra by a σCi (S) (select) operation. Horizontal fragment sets with conditions C1, C2, …, Cn include all the tuples in S so that every tuple in S satisfies (C1 or C2 Or…Or Cn). This condition is known as a complete horizontal fragmentation of S. In some case complete horizontal fragmentation can be disjointed as sometimes there is no tuple in S that satisfies (C1 or C2 Or…Or Cn).

Disadvantages

Numerous problems associated with using distributed database systems do not exist with a centralized database management system. One problem is dealing with multiple copies of the same data in multiple locations. Concurrency control method main task is keeping consistency among these copies especially if the site storing the copy fails. Failure of sites is a big issue. The distributed system should continue operations even though one site has failed. When the site is brought back online, it must be updated with the most current information. The system must be able to deal with communications issues. If a communication link becomes unavailable, partitioning may occur causing only those site within a partition to be able to communicate with each other, but not outside of the partition.  This failure to communicate with sites outside the partition can cause failure to commit transactions. The two-phase commit protocol is often used to correct these issues. Deadlock among several sites can occur requiring techniques dealing with deadlocks be used to mitigate (Elmasri & Navathe, 2016).

NoSQL

NoSQL created in the 1960’s only recently being labeled NoSQL. It has seen a new interest because of its adaptability to Web 2.0 applications. NoSQL provides storage and retrieval of data by means other than the traditional tabular relations common in DBMS applications.  NoSQL is defined as Not Only SQL. It was developed with the realization that many systems need more than just traditional relational database management systems. Most NoSQL systems use a distributed database system to store and manipulate data. These systems focus semi-structured data storage, high performance, scalability, and data replication as opposed to more traditional structured systems.

NoSQL was developed to manage storage systems with billions of records such as Googles or Yahoo’s email systems. These systems have millions of users with thousands upon thousands of emails stored in their databases. A structured database system may not be suitable because many SQL systems have unneeded services such as a powerful query language, concurrency control for example. These services may not be useful in many applications. Add to the list that a structured traditional system may be too restraining or require schemas not generally used in NoSQL. Other, perhaps more obvious problems concern the type of data stored and called, that traditional database systems cannot handle; graphic images, photos, videos for example.

NoSQL seems as if it was built with distributed database systems in mind. NoSQL systems stress high availability and love replicating data. Scalability is an important characteristic because much of the data in a NoSQL system tends to keep on growing exponentially.

NoSQL tends to use horizontal scalability, expanding the number of distributed system nodes for more data storage and processing volume all while the system is operational. Using horizontal scaling requires techniques for distributing existing data amongst the new nodes. Horizontal scalability is useful as many applications using NoSQL demand continual system availability, and horizontal scaling allows for increasing the size of the system while running. The system can never go down or crash. Failure is not an option.

Sharding of files is used extensively. Many thousands of users concurrently access files stored in NoSQL applications. These datasets contained multiple datatypes, documents or objects that make it impractical to store all this data in one node. Sharding distributes the data load to numerous nodes throughout the system. Sharding and replication together help to improve load balancing and data availability in the system.

NoSQL uses two replication models; master-slave and master-master. Master-slave means one copy is the master with all transactions going first to the master and then pushed to the slave. The master-slave model uses eventual consistency, means the slave data will eventually match the master data.

Master-Master replication reads and writes at any of the replicated sites but doesn’t guarantee that all the sites have the same data or same values. A reconciliation method is used to assure all replicated datasets contain the same values at the end of the day.

Write operations can be a bit cumbersome since so many replicated copies need to be continuously updated. High performance is required, but serializable consistency is not so important.

NoSQL applications are required to find individual objects or records from hundreds of millions of records. NoSQL, to achieve high-performance of data access, uses hashing or range partitioning techniques on object keys, where access to most objects or records occurs. The object key is like the concept objectID. Hashing applies a hash function h(K) to key K in which the value of h(K) determines the location of K.

Range partitioning determines the location using a range of key values. A user could find K in location i by using the range Kimin ≤ K ≤ Kimax (Elmasri & Navathe, 2016).

Lucene ("Apache Lucene - Apache Lucene Core," 2018), Solr ("Apache Solr -," 2018), ElasticSearch ("Open Source Search & Analytics · Elasticsearch | Elastic," 2018)  and Unbxd ("Unbxd - Ecommerce Site Search & Product Recommendations Solutions," 2018)  use NoSQL as a base for their applications.

Hadoop

Apache Hadoop (Apache Hadoop, 2018) is a collection of open source software utilities that allow using a network of computers to solve problems using massive amounts of data and computation. There are many potential applications for using Hadoop, especially involving big data applications. Hadoop is part of the Apache Open Source Software (OOS) Consortium. Apache Hadoop is a group of clustered servers empowering the processing of large distributed datasets. It is designed to scale from a single server to thousands offering snippets of computational units and data storage

Big data systems are being developed to handle storage, analysis, and mining of the huge amounts of data being produced daily from many industries including retail, insurance, manufacturing, government, and education. Big data comes from DDBS and machine learning algorithms trained to make decisions in place of humans providing quicker responses to queries and finding data otherwise unknown.

Hadoop came about because of the need for an opensource search engine for big data applications. Hadoop is part of the MapReduce programming model. Hadoop is made up of two main components: the MapReduce programming paradigm and Hadoop Distributed Filing System (HDFS) (Verma & Singh, 2017).

MapReduce was designed to handle hundreds of special-purpose computations on large datasets; examples included inverted web indexes, web graphs, statistics. MapReduce assumes an underlying data model which treats an object of interest as a form of the unique key with associated value or content; aka, key-value pair. Numerous computations apply a map operation to each logical record producing a set of intermediate key-value pairs. A reduce operation is applied to all values that share the same key. By combining the derived data, the model can parallelize large computations easily and uses re-execution as the primary fault tolerance mechanism (Satish & P., 2016).

The map and reduce functions have the following general form: map[K1,V1] which is (key, value): List[K2,V2] and reduce(K2, List[V2]) : List[K3,V3] (Elmasri & Navathe, 2016).

HDFS is the file system component that runs on a cluster of commodity hardware. It is a UNIX file system that relaxes a few POSIX (portable operating system interface) rules enabling streaming to file system data. It provides high-throughput to large datasets. 

Metadata is stored on a dedicated server named NameNode, and the application data is stored on a server called DataNodes.

NameNode stores an image of the file system made up of i-nodes and block locations. Write-Ahead commit logs, called Journals, maintain any changes made to the files.  All servers communicate using TCP based protocols. File content is replicated on multiple DataNodes to make the data more durable, the same as used on Google File System. This file replication system increases reliability, multiplies data transfer bandwidth, and colocation of data computation is enabled (Elmasri & Navathe, 2016).

DataNodes store blocks in the nodes native file system, generally containing unstructured data. Unstructured data is information whose model is not well defined and has no accompanying meta-data for representation and reasoning; it is based on comprehension of natural language.

Clients are directed to the blocks they requested in the DataNodes containing these blocks by the NameNode. A block contains a file with metadata and a file with the data. DataNodes and NameNode’s communicate with each other via Heartbeat mechanism, which reports on the state of the nodes periodically with the block id, generation stamp, and length of the block; called a Block Report.

Because block locations move around in the system, the block locations are not part of the namespace image in the NameNode and must be obtained from the Block Report. The Block Report information is used for scheduling by the MapReduce tracker and the NameNode. The NameNode respond to the Heartbeat request from the DataNodes with one of the following commands:

·       Replicate a block to another node

·       Remove a block replica

·       Reregister the node or shut it down

·       Send an immediate block report

Application to Web Search

Information Retrieval (IR)

The World Wide Web (WWW) has introduced the world to a huge volume of unstructured data. The internet has caused an explosion of data in the form of messages, documents, photos, diagrams, recordings, and videos, all stored in a variety of standard formats like HTML, XML, and numerous audio and video formats. IR deals with the problems of retrieving, storing, and indexing such information such that users can retrieve the information they want with relative ease. IR has to deal with the amount of information that numbers in the trillions today and is growing exponentially yearly. This document is just one small part of the mass of data that is added to data systems yearly around the world.

IR systems use a free-form type of search request. It doesn’t expect the user to know the structured schema of the database or even data location. The free-form allows for search by keyword queries, and the system will interpret the meaning.

Databases deal with well structured, well defined formal languages in which to retrieve information quickly and efficiently. IR systems regularly deal with ill-defined, vague, unstructured search queries. Databases have well-defined schemas; IR systems have no fixed data model. Databases return a well-defined table of solutions, IR returns a list of document ids.

IR has been around longer than databases starting originally as a retrieval system searching for titles, authors, topics and keywords in library academic circles as part of the Library and Information Sciences. Today, these same concepts of the search are repurposed in the world of eCommerce.

Amazon, the largest eCommerce business in the world, currently sells around 230 billion products. They show a linear increase in business yearly adding billions of more products. Their use of IR allows them to form an exceptional user experience (UX) allowing them a deep understanding of customer behavior derived from extensive A/B testing in their search tools (Fattal, 2014).

Bradford Exchange and Hammacher Schlemmer are currently upgrading their search to an IR intensive search tool that will allow them to use machine learning to help improve the search experience for their customers. The intent is to show an increase in converting a search into sales resulting in millions of dollars in increased revenue.

Search Engine

Search engines are website features which allow users to search using symbols, words, or numbers. Within eCommerce sites, these search engines match these inputs against values for properties and dimensions known as attributes; color, size, material for instance. Tracking data include sales or views of a product or customer feedback such as reviews (Aucamp, 2014).  

Apache Solr

Apache Solr is a scalable, well-performing, feature-rich open source software search engine that is growing in acceptance today. Combine Solr with Hadoop, and you have a search engine capable of exploring big data systems, and it’s all free. Solr is an Apache OSS enterprise search platform used by LucidWorks, Elasticsearch, PolySpot, Unbxd, and many others. Companies like Unbxd offer user interfaces on top of the Solr search platform as a means in which to manage the search process. It is one of the most widely used open source platforms in the world. Solr is built on top of Apache Lucene, an open-source retrieval library. Solr is written in Java and is part of the Apache Lucene Project. It features include faceted search; Faceted search is a method involving enhancing traditional search procedures with a faceted navigation system. Users can narrow down search results using multiple filters. Other features include NoSQL, rich document handling, real-time indexing, database integration, and dynamic clustering.

Relevancy

Relevancy plays a major role in the search for eCommerce sites — the closer return information to the requested search information posted by the customer, the closer to closing the sale. Several factors are involved that affect search relevancy. These include:

·       Query Analysis, which involves frequency of terms, click to buy ratio, time spent on the page.

·       Content relevancy deals with product or document content.

·       Geographic relevancy involves user behavior, regional-based search.

·       Time-Based relevancy involves the time of day, time of year, seasonal.

·       Contextual relevancy involves the semantic meaning of the word

·       Social relevancy based on the promotional and recommendation influence of the user.

·       Personalization relevancy involves explicit or implicit such as login or cookies to help boost sales.

Queries are made up of entries made by the user. They can contain symbols, words, numbers that the user types into a search box. These entries in the search box are commonly referred to as query terms or keywords. Queries can consist of both long terms; four or more terms, and short-term queries with less than four terms in them. Longer Search queries, with more than four terms, are favored by many eCommerce sites as it tells them the potential customer is close to deciding on what they seek. Sort term, four or less, usually provides a broader return leading to many results versus the longer term of four or more terms in which the returns are narrower and closer to what the customer sought (Aucamp, 2014). 

Example: Long vs. Short Queries

Query (Broad)

Short Query

Long Query

Chair

Dining room chair

Ikea Henriksdal Highback upholstered chair

Camera

Canon Camera

Canon SX50 Digital Camera

 

Search comprises two paths or actions which can be taken either independently or in combination. The first path is when the user enters a keyword search query. The next path the user uses navigational links to browse or filter the results. That second path is referred to as the search browse; also known as guided navigation or facet driven dialog.

What is gauged in a search query is the user’s intent. What is the user trying to accomplish, what are they seeking? The system aims to measure the user’s intent through their online behavior when searching and clicking. Much of machine learning intent is to gauge, correctly, what the user seeks by their behavior while searching and browsing the site. The understanding is that user behavior belies their intent.

Users intent comes in two types of intent. First is implied intent. Implied intent is when the shopper is a novice and not sure of what they seek. They use broader query terms such as chair from the table above. A more experienced shopper who is technically aware would use the more refined search terms such as Canon Camera or Canon SX50 Digital camera from the above table. These types of searches are referred to as explicit intent search (Satish & P., 2016).

Conclusion

Search on any eCommerce site is an essential feature affecting key metrics such as user experience, conversions, insights into user behavior, the language users use when searching, shopping habits and wealth of information waiting for discovery. As shown in this research, there are many tools available for eCommerce sites to utilize. Many of these tools include advanced searches like NoSQL and Hadoop. Search engines like Unbxd, Solr, Elasticsearch, and Lucidworks provide both a search engine and the interfaces in which to manage these various tools. At Bradford, they’re looking at interface tools that allow for controlling attribute and facet values. Bradford will be exploring utilizing machine learning so that decisions concerning what the customer sees are made instantly by the machine; keep in mind that the human still needs to teach the machine the logic behind the decision. Search doesn’t need to be difficult. But thought must be put into proper integration and optimization to ensure delivery of the correct results.

 

 

References:

Apache Hadoop – (2018, December 23) Retrieved from https://hadoop.apache.org/

Apache Lucene - Apache Lucene Core. (2018, December 23). Retrieved from

 https://lucene.apache.org/core/

Apache Solr -. (2018, December 23). Retrieved from http://lucene.apache.org/solr/

Elmasri, R., & Navathe, S. B. (2016). Fundamentals of database systems (7th ed.). Boston

Karambelkar, Hrishikesh Vijay. (2015). Scaling big data with Hadoop and Solr : understand,

            design, build, and optimize your big data search engine with Hadoop and Apache Solr

            (2nd ed.). Birmingham, England: Packt Publishing

Verma, N., & Singh, J. (2017). An intelligent approach to Big Data analytics for sustainable retail

            environment using Apriori-MapReduce framework. Industrial Management & Data Systems,

 117(7), 1503–1520. https://doi.org/10.1108/IMDS-09-2016-0367

  Fattal, T. (2014). Creation of an Analytics platform for an on-site eCommerce search engine.

  Jaco Aucamp. (2014). eCommerce Site Search Engine - Best Practise. Unpublished.

            https://doi.org/10.13140/RG.2.1.2297.1042

Open Source Search & Analytics · Elasticsearch | Elastic. (2018, December 23). Retrieved from https://www.elastic.co/

 Satish, R., & P., N. (2016). Trend Analysis of E-Commerce Data using Hadoop Ecosystem. International

            Journal of Computer Applications, 147(6), 1-5. doi:10.5120/ijca2016911109

Unbxd - Ecommerce Site Search & Product Recommendations Solutions. (2018, December 23). Retrieved from https://unbxd.com/