A large part of software maintenance is source code refactoring, which consists of making improvements and improving the efficiency of code whilst the external behaviour is unchanged. Fowler introduced the concept of design smells to help identify what kind of refactoring is required and where it is needed. The design smells (also called Anti Patterns) are defined as symptoms of poor solutions to recurring design problems. Due to the complexity of Machine Learning systems, Anti-Patterns are common. Some of the key causes of poor practice are:
- ML systems rely on data and therefore data privacy, security and data loss prevention are important considerations.
- ML systems require complex infrastructure and data engineering which is utilised by multiple teams.
- ML systems often break software engineering best practice and has been called the ‘High-interest credit card of technical debt’ by Sculley et al.
What is Production Grade Machine Learning?
For a data scientist, machine learning will often require data-wrangling once. Then performing necessary ETLT to gather the required data which is then fed into Machine Learning models and iterated upon. The objective is usually to maximise the performance metric. Production grade machine learning will require scalability of components, streaming data to continuously train and monitor subsequent applications. There are three main phases; development, deployment and operation to a machine learning project each with associated anti-patterns.
Single Point of Failure
Often there is a single person tasked to implement the full end to end machine learning process. This requires skills usually found in Project Management, Data Engineers, Machine Learning, Cloud Architect and Dev Ops Engineers. The chances of a person having the required depth of knowledge in each of these fields are fairly small. If an individual within your team possesses this skillset, it is advisable to build the rest of your organisation around them. This allows for knowledge transfer and in turn, will limit any challenges that may arise when the person leaves. In an ideal world the team will consist of:
- Data Engineer — Data wrangling and building ETLT process
- Machine Learning Engineer or Data Scientist — Building the models
- Dev Ops Engineer — Building Scalable infrastructure
- Project Manager — Ensuring each member of the team is on target and liaising with decision-makers
- Decision Makers — Clients and product owners who define the scope
Siloed Working Practices
At the opposite end of the spectrum, working solely within an area of expertise will lead to a lack of understanding on the full end to end process. Machine Learning models are often complex and if a Dev Ops Engineer deploys the model without full understanding, further refinement and debugging can become very difficult. The project manager should be mindful of this and ensure all members have a clear picture and understanding of the project.
There is often a choice between the speed of development to meet deadlines to ship new services and the quality of the code. Ward Cunningham introduced technical debt in 1992 to quantify the cost of choosing speed. Similarly to financial debt, not all technical debt is bad as there are often reasons to take on an amount, but care should be taken to account for this otherwise the “interest” can become unmanageable. As Sculley et al note “it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning”. Machine learning will have all the common pitfalls as any other form of code but the degree of system complexity can create hidden debt which can mount quickly. As noted above, machine learning breaks traditional software best practises, most notably abstraction. Due to the reliance on external data, the logic cannot be separated from the data and so any irregularities in the data can have a large impact. This is known as the CACE (Change Anything Change Everything) principle. For example, if the distribution of a feature changes, so does the importance and weights of all the other features. Due to this principle initial development and deployment is often straightforward, but changing the model can be time-consuming. Due to their nature machine learning processes will include feedback loops to continuously learn from new data, which modifies the behaviour. This leads to a form of technical debt, in which it is difficult to predict the behaviour of a given model before it is released. These feedback loops can take different forms, but they are all the more difficult to detect and address if the case of batch updated models as there is a time delay between observation and the problem occuring. An example of this is a hidden feedback loop in predictive policing. Several researchers have shown that Machine Learning algorithms can be applied to predict patterns in crime. Initial results have shown that these algorithms work exceptionally well. While these algorithms can show where crimes will happen, the police will increase patrols in those areas in an effort to fight crime and this will feedback into the algorithm.
ML Lifecycle Management Anti Pattern
Anti-Patterns involving using different models or architecture are used to train a model then to deploy in production. Although counterintuitive this occurs in many businesses. This usually occurs when a team develops and creates a model using technology they are familiar with or when the scope of a project is not properly defined. This model works well and produces predictions which can be validated with a high degree of success. All looking good so far. However when deploying to production problems occur, the model cannot process real-time information or handle the volume of information required. The model cannot be adapted to work in the production environment, so the team develops the model using a different technology to cope. Things stop looking so good. The typical analytics process is a closed-loop of re-training, re-validation and improvement of existing models. To meet fast-paced moving targets an agile innovative workflow is needed which the typical process can feed, the anti-pattern problem requires work to be duplicated in multiple technologies.
Closed Analytics Loop
The diagram below shows a typical machine learning workflow without the anti-pattern problem. The same model is used in the offline training environment and the online application environment. This is the ideal situation as there is no need for redevelopment. Due to the number of analytical tools available this can be difficult to achieve. At every step of the process there are numerous options which each have different advantages and are equally useful in different situations:
- Programming languages such as R or Python
- Open-source frameworks such as Apache Spark ML, H2O.ai or TensorFlow
- Commercial tools such as SAS or MATLAB
Different user groups will have different preferences, for example, statisticians, in general, will prefer to use R and scientists will prefer Python. The most important step to avoid this anti-pattern problem is to plan as a team before any work begins. Business analysts, data scientists and developers should gather requirements and write a joint business case, bearing in mind:
- Real-Time response — How to adapt for new proactive actions in real-time?
- Scalability — What is the maximum load expected? Is there a chance this will increase?
- Velocity — How many events do you have to process? (Predictions based upon IoT observations or sensor logs will have many more data points than sales figures).
- Refinements — How easily can the model be changed, improved, and re-deployed? Most models will follow the CACE principle.
- Think about how to allow a closed-loop — maybe even an automated one.
As Martin Zinkevich states “Be aware that identical short-term behaviour does not imply identical long-term behaviour”. This is known as training-serving skew, a difference between performance while training and in production (serving). This can be caused in a number of ways, the most common are:
- Differences in how you handle data in the training and production workflow
- Different data being used to train
- A feedback loop (see above for hidden feedback loops)
As with feedback loops, the best solution is to monitor the system to ensure any issues are detected quickly. For example, a system that calculates the probability of a click for a document in a query-based upon doc_id and query_id, initial testing shows the behaviour is very similar to the current system. Upon deployment it is noticed that no new documents are being shown, as there is no way for the algorithm to learn new documents, this should be shown as the training data showed the document based on its own history. To resolve this the system would need to train with live data.
How to avoid Anti-Patterns
1. Have a clear objective 2. Communicate within the team 3. Plan to iterate, build to allow this 4. Test on a different dataset to the training dataset 5. Monitor the system at differing intervals 6. Avoid black box solutions, there should be an understanding of each step.
Due to the complexity of Machine learning, Anti-Patterns are common, and the technical debt of ML projects can quickly add up. With careful planning, these can be avoided and projects can be completed successfully, on time and in the budget.