A data mining system using certain pre-existing algorithms to process raw data on South Florida companies to create usable data reports for user visualization and analysis.
THE PROBLEM
Venture Hive is a South Florida company incubator –it provides other businesses, usually start-ups, with the guidance and tools they need to grow and be successful. To incubate efficiently, Venture Hive collects large amounts of data about regional companies; however, it needs a system that mines and sifts the data to retrieve useful information. Frequent mining leads to the discovery of associations among items in large relational data sets. Venture Hive could benefit from finding these association patterns in its databases, especially in order to assess the replicability of practices that are correlated with desirable results.
PROPOSED SOLUTION
The company’s incorporation of the Regional Miner into its operational structure marked its incursion into new territory. The idea is, on the one hand, for Venture Hive to identify salient correlations between specific business attributes in order to make informed recommendations to its clients; on the other hand, to sort businesses and entrepreneurs into groups, or “clusters”, based on shared features, with the purpose of tailoring its incubation approach according to the cluster to which a client belongs.
SOLUTION'S FEATURES
- Allow the user to select filtered data and view its details.
- Display which algorithms require data adjustments in order to be executed.
- Allow the user to perform data cleaning routines (e.g. replace missing values) to the selected data set for a specific algorithm.
- Allow the user to perform data transformation to provide better results (e.g. discretize and normalize) given a data set and specific algorithm.
- Allow the user to run multiple algorithms, regardless of type (i.e. association and clustering), in a single instruction.
- Allow the user to save a run configuration (i.e. data set, data adjustments and algorithms) for future execution.
SYSTEM DESIGN
The system architecture for the Regional Miner is a Three Tier Architecture. The View Tier will be allocated to the client’s computer, which can vary with regards to operating system. The Middle Tier, containing the Weka application and the logic platform, will run in a dedicated Unix-based server using the time-based job scheduler cron. The MySQL-implemented Database Tier will be distributed in a separate database server. The middle tier server and the database server will be connected by the JDBC.
DATA MINING SUBSYSTEM IMPLEMENTATION
The Data Mining Subsystem allows the execution of range of algorithms choices (i.e. association and clustering) in a single instruction, with each algorithm producing an individual result. Furthermore, the system is capable of indicating which algorithms require data adjustments, to consequently provide the user with routines for data cleaning and data transformation. Additionally, every run configuration can be executed at the moment or can be schedule for future execution.
Weka 3.7 developer version, a suite of machine learning software written in Java, is employed by our system –also written in Java for compatibility– to offer the user a range of algorithm choices from among the following two categories: association and clustering. The algorithms are run on data sets to generate results that are intelligible to the user; multiple algorithms can be run at once, with each algorithm producing an individual result. Now, because some data sets need to be refined before they are ready to be subjected to algorithm runs, the system uses tools provided by Weka for data pre-processing. Whenever pre-processing is necessary, the user will be guided by system recommendations. All the user-initiated tasks mentioned above can be scheduled by the user to for future execution.
SYSTEM VALIDATION
During this phase of the project, we performed manual system testing, which involved creating the testing scenarios as well as unit testing and stress testing. As a result of this phase, many aspects of our implementation changed. For example, the data mining subsystem had a major drift from its initial idea. Instead of having an especially clean and transform data set for each algorithm in the configuration, only one data set is uploaded for all algorithms and filtering is set before the execution and applied during run time. This saves memory space, but makes processing time slightly slower.
SUMMARY
The Regional Miner helps Venture Hive incubate its clients in two main ways. First, it picks out correlations or associations between specific attributes; this allows Venture Hive to make recommendations to its clients based on what sorts of business practices are correlated with good results (e.g., number of founders being positively correlated with annual revenue). Second, it assembles groups of businesses or entrepreneurs based on key features that they have in common, no matter how apparent the features may be to a human observer. Different incubation methods will suit different companies, yet companies in the same cluster will likely benefit from the same method. Venture Hive can “profile” a client according to the cluster to which it belongs in order to reduce the range of appropriate incubation methods.
PROJECT SUMMARY
Owner – Venture Hive
Status – Deployed
Team – 3, including mentor
Time Length – 4 months
Methodology – Agile
PERSONAL CONTRIBUTION
Role – Team Leader
Responsibilities –
Backend Developer
Data Mining & Analytics
Database Manager
METHODS & TECHNIQUES
Interviews
Scenarios
Feasibility Study
Use Case
Static Modeling
Dynamic Modeling
System Testing
UML
TECHNOLOGIES
Java
Java Swing
JDBC
MySQL
Weka Data Mining Tool