Ph.D. Research


My research work falls into the following broad areas. Please look at the publications page for recent accomplishments.

RESEARCH INTERESTS
Databases, Data Streams, Distributed Systems, Real-time Event Processing, Click Fraud, Cyber Security, Information Privacy, Data Mining, Query Language Design, Performance Measurement, and Social Networks.

SUMMARY OF SELECTED RESEARCH PROJECTS

  1. Real-time Scheduling of Queries in Data Stream Environments
  2. OM: A Framework for Continuous Query Optimization
  3. Click Fraud Detection
  4. Private Matching of Sensitive Data
  5. Performance Analysis of Web and Information Services
  6. Tools and Benchmarks for Light-weight Directory Access Protocol (LDAP) Based Services

  1. Real-time Scheduling of Queries in Data Stream Environments

    There has been considerable recent interest in building Data Stream Management Systems. Most such work addresses the issues of processing efficiency or memory utilization. However, very little of continuous queries in real-time environments. Real-time processing of continuous queries is essential in numerous applications. For example, production management systems may require the problem diagnosis to be done within a few seconds, and vehicular traffic management systems may require traffic statistics to be computed or corrective actions to be taken before the information becomes stale. We refer to such systems as Real-time Data Stream Management Systems (RDSMS). Work exists on real-time systems for relational data, but these techniques are mostly unsuited for data stream environments.

    The RDSMS is responsible for executing the queries over recently-arrived data. The volume of data to be processed per query execution is called the data load, and is monitored and controlled by the data load manager. The number of queries present in the RDSMS system is called the query load. The query load manager admits new queries and discards obsolete queries. The performance monitor is responsible for monitoring the system performance, such as changes in load conditions and frequency of missed deadlines. In our work, we introduce a new component, called the load-adaptive feedback control, whose job is to provide timely feedback to control the data load and adjust the query schedules.

  2. OM: A Framework for Continuous Query Optimization

    A sequence of data elements that is continuous, unbounded, and time-varying is called a data stream. For example, data transmitted by sensors, stock market data and network monitoring data. Users issue a continuous query (CQ) over data streams and when new data arrives, the Data Stream Management System (DSMS) executes the CQ and returns the result. The same query is re-executed when new data elements arrive into the system. This query processing model is significantly different from relational database management systems (RDBMS). A DSMS is also likely to be used in environments where multiple users and queries exist. For example, online bidding systems such as eBay and real-time financial search engines such as TraderBot.

    Continuous queries are an important class of queries in data stream applications. While much work has been done on algorithms for processing CQs, less attention has been paid to the issue of optimizing such queries. In this work, we argue that parameters such as output rate and main memory utilization are important cost objectives for CQ performance, than disk I/O. We propose a novel framework, called OM, to optimize the memory utilization and output rate of CQs. Our technique monitors input stream and query characteristics, and switches plans only at certain boundary conditions. This approach is tunable to application requirements and enables the user to make the query performance versus optimization overhead trade-off.

  3. Click Fraud Detection

    Online information and data service is a growing industry. Stock exchanges, news services, and on-line vendors such as Yahoo!, already market stock quotes, news, and music, respectively, on the Internet. Roles are also becoming specialized. Publishers may have data domain expertise, but may not be able to disseminate data or manage clients efficiently. Therefore, an ancillary industry of data brokers has developed in parallel with the content creation industry. Brokers may maintain servers to enhance data delivery quality, manage subscriptions, provide anonymity guarantees, and support different payment options for clients and publishers. Examples of brokers or intermediaries can include Akamai, which provide enhanced data dissemination features.

    Current systems typically require publisher to trust brokers to behave honestly, though such trust may not always be warranted. We do not assume that brokers are honest, and propose methods to detect broker dishonesty.

    Click Inflation a topic of current interest, can be caused by broker dishonesty or neglect, with reports suggesting that up to 20% of reported clicks may be fraudulent. Major players such as Yahoo and Google have already been settling significant allegations of click fraud. As the content brokerage industry grows, so will the need for security protocols to guard against broker dishonesty. Work exists on pricing techniques in this domain, but such work tends to assume honest brokers and clients. This assumption is increasingly becoming untenable. We propose schemes to alert publishers to broker dishonesty.

  4. Private Matching of Sensitive Data

    Many companies wish to share their business data with other companies, usually with the goal of improving their marketing strategy or performing a joint business exercises. For example, two companies might wish to conduct a joint marketing survey to determine the common customers. This requires finding common entries in two databases. Traditionally, this is done by executing a distributed join operation over the databases. However, the companies may wish not to reveal any private information, for example, information about customers that is found only in one database. This problem is known as private matching, and has generated significant research interest.

    Private matching is challenging because any solution should ensure both the following, (1) neither party learns more than their own data and the answer, (2) if one party learns the results of the match, both parties should learn it. Companies are likely to have security concerns, because private matching is prone to spoofing and guessing attacks. For example, one company may act dishonestly and try to guess customers in the other database, with the goal of discovering private information of customers. In the spoofing attack, a company might just lie about the existence of some customers and try to discover additional information present in other database. This problem has implications beyond the business domain. Consider a government agency that needs to consult a private database, for example a database of health records. Privacy concerns on the part of the government agency may require it to access the health records, without revealing the query. On the contrary, the owner of the statistical database may want to release information on a 'need-to-know' basis, and guard against releasing any extra information. We are currently investigating private matching techniques that are resistant against spoofing and guessing attacks.

  5. Performance Analysis of Web and Information Services

    Over the years, the Internet and its usage has grown very rapidly. For example, large amounts of information is being stored on web severs or file servers, instead of books. At the click of a button, almost any kind of information can be searched, downloaded or viewed online. In general, such client-server environments abound the Internet, wherein users connect to a server and access information available at the server. Typical information dissemination sites such as news agencies, digital libraries, and data set repositories are a small set of examples of services available over the Internet. As reported in recent statistical studies, the Internet usage is growing at an alarming rate and the web services are becoming more dynamic. Unfortunately, poor planning regarding the expected load or performance can prove disastrous for the web service provider. Hence, there is a need to simulate and study the web server performance, or any information disseminating client server environment.

    In this project, we simulate the performance of a typical web service. The basic idea is to model the web server, Internet and the clients accessing the web service as an open queuing network. Our contribution lies in using a readily available simulation toolkit and in developing an extensible framework to simulate and study the model. We have developed the simulation framework using CSIM, which is a process-oriented, general-purpose simulation toolkit that can be integrated into C/C++ programs.

  6. Tools and Benchmarks for Light-weight Directory Access Protocol (LDAP) Based Services

    LDAP-based directory servers are being widely used in Grid computing, directory-enabled networking and operating system support. Hence, benchmarking and analyzing the performance of a directory server is very crucial. In this project, we analyze the performance of the openldap-2.0.23, which is an open source LDAP directory server available from the OpenLDAP Foundation. We measure the throughput and the latency for LDAP queries using a stress test, where the server is heavily loaded when numerous clients simultaneously access the service. In other experiments, we focus on analyzing the update performance and the effect of indexing on the performance of LDAP queries. Our results provide interesting insights into identifying system bottlenecks and tuning the server performance.

Disclaimer