Space is not afraid of downtime. How did NASA, the originator of software engineering, do it?

 To many people, NASA is very mysterious. NASA, the National Aeronautics and Space Administration, is an administrative agency of the U.S. federal government responsible for formulating and implementing the U.S. civil space program, as well as conducting research on aviation science and space science. NASA is responsible for space exploration in the United States, such as the "Apollo" program to the moon, the space laboratory, and the subsequent space shuttle.


   It is particularly commendable that NASA has also embraced open source enthusiastically: In the aerospace field, NASA is at the forefront of open source and has open sourced a large number of software and design tools, covering the entire development and application process of spacecraft. As we all know, every NASA space mission cannot do without advanced software programs. What most people may not know is that the concept of "software engineering" today originated from NASA. This shows the importance of software architecture.


   Glenn Fleishman(hermes 2021 bags) shared with us all aspects of the evolution of NASA’s software architecture, which is very readable. InfoQ Chinese station translated it, just to remember the original intention: our journey is the stars and the sea!


   When you are in space millions of miles away from home, if the server crashes, no one can hear your panic cry! So for NASA, redundancy is particularly important. Since they can send five servers outside the stratosphere, they will definitely not only send one.


  In space, cosmic rays, wear and tear, sudden failures and other uncertain factors will affect the equipment, and the software architecture must be very robust. David Garlan, the founder of the software architecture field and an outstanding visiting scientist at NASA's Jet Propulsion Laboratory, said that spacecraft systems in particular need a fault protection layer to enable them to switch to an emergency without direct intervention from the earth.


    Compared with the software on the ground, the design of aerospace software is more difficult.


   On the ground, if the server performance of a data center is too slow, it can be solved by adding more servers, but the shortage of spacecraft's computing power cannot be solved by adding more machines. The computing power of the spacecraft remains constant throughout the mission, and the system must be designed to automatically discard certain given missions.


   If it is a database server on the ground, it will not cause much impact if data cannot be inserted in real time. However, a spacecraft flying to Mars, if its cycle is not perfectly controlled, may completely miss Mars.


    Continuous operation for 2700 days depends on the self-recovery strategy.


   The number of computers that can be used in a spacecraft is strictly limited. Adding physical backups will make the spacecraft software more flexible. For the NASA space shuttle project from 1972 to 2011, only three to four computers were not enough. At that time, there were five flight control computers, and adding one more was a matter of repeated consideration by designers.


   In addition, NASA has also developed a strategy that allows software to recover in the worst case. This strategy was first implemented midway through the "Apollo" plan from 1961 to 1972, and it was specifically designed for problems. Without this strategy, one task after another will have to be abandoned.


During the launch in 1977, NASA space probe Voyager 2 recorded some jitters. These jitters are likely to cause the server to "down". Scientists cannot predict how the program will interpret these jitters, but the detection The device finally entered the recovery mode correctly.


   The "Opportunity" rover was launched in 2003. It was originally planned to operate on Mars for 90 days, but in fact, by mid-2018, it is still operating for a total of 5111 days. This Mars rover was originally designed to have a lifespan of 90 days, but it was over-served for 15 years. In the middle, the code was rewritten based on the collected mobile data to bypass the fault caused by a short cable.


   "Curiosity" was launched in 2011. It was originally planned to operate on Mars for 687 days, but it is still in operation today. As of this writing, it has exceeded 2,700 days. During the first and fifth years of the Curiosity rover on Mars, its two main computers had malfunctioned, almost causing the mission to fail.


    Sometimes, redundancy can also increase opportunities.


Voyager 2 passed Uranus in 1986 and Neptune in 1989. If it hadn’t been for Voyager 2 to have a set of spare computers, and the mission control center could upload new software to use these spare computers, otherwise , "Voyager 2" can take much less pictures of Uranus and Neptune.


Only backup can beat Murphy's law


   Allow a few minutes of "out of service" time, or have a failover solution, this is a trade-off between the pros and cons. The computing hardware on the earth has evolved from a single-business mainframe to a powerful redundant array of servers. This array allows one or more of the servers to fail without interrupting the business. Out of necessity, NASA also decided to let multiple computers in space run repetitive functions.


   Even with perfect software and perfect hardware, it is still possible to crash in space: only backups can defeat Murphy's law and cosmic rays. During the "Apollo" program, NASA worked to ensure that every component and system was tested until it was determined that it was perfect. However, this move is expensive and fragile.


   "Perfect" can't solve the problem


   There are two famous examples that illustrate the problem of relying on perfection. Female engineer Margaret Hamilton was the team leader of the Massachusetts Institute of Technology that developed the "Apollo" program software in the late 1960s. She often took her toddler daughter Lauren to the office to work overtime late at night and on weekends. The "Apollo 8" mission performed at the end of 1968 will mark the astronaut's first flight around the moon. Before that, the little girl Lauren used the "Apollo" computer DSKY, which is a combination of keyboard and display, to play the command module simulation Device. She unexpectedly triggered the pre-launch sequence, causing the flight simulation to collapse.


   Hamilton tried to persuade NASA to allow her to introduce error checking to prevent astronauts from making the same mistakes during missions, although such mistakes are unlikely to happen. But NASA rejected her request, insisting that the astronauts will complete the mission perfectly. As a last resort, Hamilton marked the possibility of this issue in the manual.


   Then, astronaut Jim Lovell happened to choose the same order during the flight of Apollo 8. As a result, the navigation data needed to return to Earth was erased from the spacecraft's memory. When "Apollo 13" exploded, a very famous sentence was reported to the Houston Command Center: "Houston, we have a problem" (Houston, we have a problem). Fortunately, "Apollo 13 "Successfully dealt with this dangerous situation, not so much as jumping on a makeshift lifeboat, but as finding a spare tape: Hamilton and her team were able to transmit navigation data from the earth because the system was flexible enough to be in transit. Accept these inputs.


  Hamilton designed the system to be flexible, so that in the event of an overwhelming situation, the system can still resume normal operation without interference, and allows it to report errors, while providing enough information to make a judgment. In this case, the computer's load management software focuses on higher priority tasks, including radar input, and performs as expected. After intense and rapid consultations at the Mission Control Center, the astronauts obtained approval a few seconds before the lunar lander's fuel ran out.


In 1994, Hamilton told Air & Space magazine: "Our software saved this mission because it was asynchronous and because it would skip low-priority tasks. Without it, this mission Will fail or crash on the moon."


Computers have become more and more powerful, and software engineering, the term Hamilton created for space flight, has matured after several years of development. Therefore, the design of the space shuttle did not consider the main system and backup system, or even several backup systems. Instead, it relied on four independent computers to run the same navigation and guidance software and receive the same data input.


   These four computers operate as the epitome of democracy. The three of them must agree on what they are measuring before they can take action. If three computers agree and the fourth computer disagrees, then the astronaut will shut down or restart it in this case. This necessitates the quick decisions needed to avoid disasters or costly pauses.


   If multiple computers fail or a consensus cannot be reached, then an additional computer that can also access the space shuttle control system can take over. It can carry out pre-programmed rough (but safe) ascent, suspension and reentry.


   Redundancy on top of redundancy


In 1964, the scientist Gary Flandro of the Jet Propulsion Laboratory calculated that by the end of the 1970s, Jupiter, Saturn, Uranus, and Neptune would line up in a straight line, allowing the probe to use gravity assists (gravity assists, also known as gravity slingshot effects, Change orbit around the planets) to visit these planets and revolve around the planets to gain acceleration. This arrangement happens every 175 years. The Voyager was sent to space for detection missions.


   Many aspects of the Voyager mission are redundant, and its two detectors were launched separately from different locations. Among them, "Voyager 1" and "Voyager 2" are equipped with two parallel computer systems. If one of the three A computers (respectively used for command, data management and attitude control) fails or crashes, the detector can automatically switch or switch to the B system through commands.


   The Voyager probe is also the first NASA probe to use software to detect faults. It can analyze various events and respond accordingly without instructions. As the project manager John Casini pointed out in an interview in the book Voyager Tales (Voyager Tales) before the launch of the mission: My personal view of the "Planetary Grand Tour" (Planetary Grand Tour) is: The rotation speed is much faster than the rotation speed of the spacecraft in space.” However, at this speed, Voyager 2 still encountered an unexpected situation. Fortunately, it judged it by itself and reset itself. In safe mode.


Casini said that even before the ground staff could figure out what had happened, “the spacecraft recovered on its own after separating from the launch vehicle.” The project staff then updated the Voyager 1 software before launching. Managed to properly handle rotation and shaking.


   Today, the "Voyager" has left the range of the solar bubble, but has not left the solar system, but the so-called heliosphere, which is now wandering in the interstellar material. But even at such a distance, the Voyager can still send data back and still surprise the small team that continues to manage the data. In 2010, "Voyager 2" began to send back some meaningless information instead of scientific data. As a result, the scientists switched the detector to standby mode. The code for this mode was formed after decades of improvement, and they also discovered the problem. They restored the program to its previous state, again using 160 bits per second to send data back to Earth.


  Redundancy can also cause problems


  In space, the system is not easy to upgrade. For tasks where hardware upgrades are impossible, "redundancy" is the right choice, but it can also cause other problems. Professor Nancy G.Leveson from the Department of Aeronautics and Astronautics of the Massachusetts Institute of Technology wrote in an article titled "The Role of Software in Spacecraft Accidents":


    NASA conducted a study on an experimental aircraft equipped with two versions of the control system and found that all software problems that occurred during the flight test were caused by errors in the redundancy management system, not the control software itself. The control software is running normally. We need to design protection functions for the software to reflect the "failure" mode of the software.


   Another important issue is that modern spacecraft systems are much more complicated than their predecessors. As more computing power and more tasks can be processed during the mission, the code that drives today's spacecraft becomes extremely complex: it is inevitable that some errors will occur. If there is a lack of standard software architecture, it will aggravate the failure. But even the most advanced spacecraft in the current program relies on older software architecture concepts to some extent. The "Orion" spacecraft is designed to send astronauts to the moon in future manned space missions. It will carry four computers, each with two processors working in parallel, and the results must be consistent. The software of each computer behaves as if it is flying a spacecraft independently.


   These computers are not Democracy, but Solipsist. Every computer considers itself "the only computer in the universe." If one computer fails to provide the right instructions at the right time, the system is designed to restart when the system fails and accept instructions from the next computer. Like the space shuttle, the "Orion" spacecraft will also be equipped with a backup flight computer.


   It has been more than 50 years since humans first landed on the moon. The moon is the only natural satellite of the earth, and a huge effort is still needed to return safely.

Comments

Popular posts from this blog

All-sky X-ray star map reveals the mystery of dark matter

You Can Buy the New Hermes Evelyne Sellier Bag Online

Buying a Bag Doesn’t Normally Carry Pleasure; One Time, It Manufactured Me Sick